Automated voiceover mixing and components therefor

ABSTRACT

Voiceover mixing is provided by receiving a voiceover file and a music file. The voiceover file is audio processed to generate a processed voiceover file and a music file is audio processed to generate a processed music file. The processed voiceover file and the processed music file are weight summed to generate a weighted combination of the processed voiceover file and the processed music file. Single band compressing is performed on the weighted combination. A creative file that contains a compressed and weighted combination of the processed voiceover file and the processed music file is then generated.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims priority to, and the benefit of, U.S.Provisional Patent Application Ser. No. 62/672,898, filed May 17, 2018,which is hereby incorporated by reference in its entirety.

TECHNICAL FIELD

Example aspects described herein relate generally to creative generationand creative trafficking systems, and in particular to a system, amethod, an apparatus and a non-transitory computer-readable storagemedium for automated voiceover mixing and components therefor.

DESCRIPTION OF RELATED ART

Existing solutions for generating and trafficking creatives involveprocesses that are variable and require different levels of effort andcost, as well as significant interaction through the use of severaltools. Creative content providers would like to hedge their goals acrossdifferent and new creative types and campaigns, but existing technologylimits their ability to do so. Backend, foundational infrastructure forperforming such functionality is lacking. One challenge in developingsuch an infrastructure lies in the lack of technology capable ofgenerating creative content based on a minimal number of input signals.

There is a need for technology that provides the connections andinterplay between the functional components through which data andcontent associated with different types of creatives can flow and beprocessed efficiently. Performing existing processes using conventionalfunctional components and pipelines becomes a significant engineeringchallenge in view of failure modes, recovery options, retries,notifications and the like. In addition, significant engineeringchallenges have prohibited the extent to which the workflows in thepipeline can be automated.

Many types of audio content, such as advertisements (“ads”), radioshows, podcasts, or movie soundtracks require a recording of a voice tobe mixed with background music or a soundscape. The mix needs to bebalanced, so that the background is audible but does not overpower thevoice. Existing voiceover mixing solutions, require trained audioengineers to manually create mixes and masters. However, this manualprocess is time consuming, subjective and costly, making it nearimpossible to scale. Accordingly, there is need for a voiceover mixingtechnological solution that automates the processes performed by themixing engineering that allows for the scalable creation of audiocreatives.

Finding media content (e.g., music that is both available for use inadvertisements and fits a desired mood) is difficult. Often advertiserswill know what they want the music to sound like and need a way tosearch through potentially large catalogs of available music.

Existing solutions such as those supplied by FREESOUND or MELODYLOOPS(www.freewound.org, www.melodyloops.com) provide a mechanism to searchthrough a collection of content using metadata or semantic tags (e.g.“acoustic”, “corporate”). These technologies typically allow searchingthrough the use of tag-based filtering. However, tag-based filteringlimits the search to a specific set of pre-existing terms and there isnot always a universal perception of how media content should becategorized. For example, there is no universal perception of what“corporate” music sounds like. Metadata allows users to search throughtitles and artists, but if the catalog contains unfamiliar (e.g., music)content, this information is not meaningful to the user. The user mayknow what they like, but not how to describe it.

Solutions for measuring similarity are described in Dieilman, S.,“Recommending music on Spotify with deep learning”, Spotify (2014). Themethods provide technical solutions to the problem of predictinglistening preferences from audio signals by training a regression modelto predict the latent representations of songs that were obtained from acollaborative filtering model. While the methods described in Dieilmanare useful for creating a deep neural network that can be used to createan n-dimensional vector for use with content-based recommendationsystems, it does not provide a technique for comparing songs wherelistening data is unavailable.

Advertisers running campaigns in multiple locations create ads that areall the same except for a segment that is specific to the location (forexample, concert tours). More specific levels of personalization, likesaying the listener's name, are not feasible because of the amount oftime required to produce all variations. Existing solutions require thatads be created manually. Existing solutions do not providehyper-personalized ads. There is a need therefore, for a technicalsolution that can personalize or localize creatives at scale.

BRIEF DESCRIPTION

In an example embodiment, a computer-implemented method for voiceovermixing is provided. The method includes receiving a voiceover file and amusic file; audio processing a voiceover file to generate a processedvoiceover file; audio processing a music file to generate a processedmusic file; weighted summing the processed voiceover file and theprocessed music file to generate a weighted combination of the processedvoiceover file and the processed music file; single band compressing theweighted combination; and generating a creative file containing acompressed and weighted combination of the processed voiceover file andthe processed music file.

In some embodiments, the method further includes measuring the energylevel of the voice file within a frequency range; and filtering thefrequency range if the energy level exceeds a predetermined threshold.

In some embodiments, the audio processing the voiceover file includesnormalizing, compressing and equalizing the voiceover file and the audioprocessing the music file includes normalizing, compressing andequalizing the music file. The voiceover file and the music file arenormalized, compressed and equalized asynchronously.

In some embodiments, the method further includes storing, in a voiceactivations store, a curve corresponding to when a voice is present inthe voiceover file.

In some embodiments, the method further includes setting anadvertisement duration time; setting a start time for the voiceoverfile; trimming the music file according to the advertisement durationtime; and mixing the voiceover file and the music file according to thestart time and the advertisement duration time.

In some embodiments, the method further includes generating a script;converting the script to voice content; and saving the voice content inthe voiceover file.

In yet other embodiments, the method further includes mapping each trackin a library of tracks to a point in an embedding space; computing anacoustic embedding based on a query track within the embedding space;obtaining a track from the library of tracks with acoustically similarcontent; and saving the track from the library of tracks withacoustically similar content in the music file.

In another example embodiment there is provided system for voiceovermixing. The system includes a voice processor, a music processor and amixing processor. The voice processor is operable to receive a voiceoverfile, and generate a processed voiceover file from the voiceover file.The music processor is operable to receive a music file, and generate aprocessed music file from the music file. The mixing processor isoperable to weight sum the processed voiceover file and the processedmusic file to generate a weighted combination of the processed voiceoverfile and the processed music file, single band compress the weightedcombination, and generate a creative file containing a compressed andweighted combination of the processed voiceover file and the processedmusic file.

In some embodiments, the voice processor is further operable to measurethe energy level of the voice file within a frequency range; and filterthe frequency range if the energy level exceeds a predeterminedthreshold.

In some embodiments, the voice processor is further operable tonormalize, compress and equalize the voiceover file and the musicprocessor further operable to normalize, compress and equalize the musicfile. The voiceover file and the music file are normalized, compressedand equalized asynchronously.

In some embodiments, the system for voiceover mixing further includes avoice activations store operable to store a curve corresponding to whena voice is present in the voiceover file.

In some embodiments, the system for voiceover mixing further includes anadvertisement store operable to store an advertisement duration time.The voice processor is also further operable to set a start time for thevoiceover file and the music processor further operable to trim themusic file according to the advertisement duration time. The mixingprocessor mixes the voiceover file and the music file according to thestart time and the advertisement duration time.

In yet other embodiments, the system for voiceover mixing furtherincludes a script processor, a text to voice processor and a voiceoverstore. The script processor is operable to generate a script from atleast one script section. The text to voice processor is operable toconvert the script to voice content. The voiceover store is configuredto save the voice content in the voiceover file.

In some embodiments the system for voiceover mixing further includes abackground music search processor. The background music search processoroperable to: map each track in a library of tracks to a point in anembedding space; compute an acoustic embedding based on a query trackwithin the embedding space; obtain a track from the library of trackswith acoustically similar content; and save the track from the libraryof tracks with acoustically similar content in the music file.

In yet another example embodiment, there is provided a non-transitorycomputer-readable medium having stored thereon one or more sequences ofinstructions for causing one or more processors to perform the voiceovermixing procedures described herein.

Another aspect of the present invention includes a computer-implementedcall to action method. The method includes receiving an entity datapointcontaining data related to an entity; receiving a campaign objectivedatapoint containing data associated with a campaign objective;receiving at least one definite script element based on the campaignobjective; receiving entity metadata containing data associated with theentity; generating at least one variable script element based on theentity metadata; presenting to a device the at least one definite scriptelement; and presenting to the device the at least one variable scriptelement.

In some embodiments, the method further includes receiving a userdatapoint containing data associated with a user of the device andgenerating at least one variable script element based on the userdatapoint.

In some embodiments, the method further includes selecting one of aplurality of possible script elements to obtain a selected scriptelement and communicating over a network the selected script element.

In some embodiments, the method further includes receiving over anetwork an information item from the device; determining whether theinformation item from the device meets a condition; and presenting afirst call to action script via the device if the information item meetsthe condition; and presenting a second call to action via the device ifthe information item does not meet the condition.

In some embodiments, the method further includes receiving an indicationfrom a device whether a user of the device is in focus. If the user ofthe device is in focus, the method performs presenting a first call foraction script element via the device. If the user of the device is notin focus, the method performs presenting a second call for action scriptelement via the device.

In some embodiments, the method further includes determining whether aresponse has been received by the device. If no response has beenreceived by the device, the method performs presenting via the device ano-response message indicating that no response has been received. If avalid response has been received by the device, the method performspresenting via the device a valid response message indicating that aresponse has been received. If an invalid response has been received bythe device, the method performs presenting via the device an invalidresponse message and communicating another call for action script.

In some embodiments, the method further includes determining if thedevice receives a tap; performing a first operation if the devicereceived the tap; and performing a second operation if the device didnot receive the tap.

In another example embodiment, there is provided a system for performingcall to action including a call to action processor operable to: areceive an entity datapoint containing data related to an entity;receive a campaign objective datapoint containing data associated with acampaign objective; receive at least one definite script element basedon the campaign objective; receive entity metadata containing dataassociated with the entity; generate at least one variable scriptelement based on the entity metadata; present to a device the at leastone definite script element; and present to the device the at least onevariable script element.

In some embodiments, the call to action processor is further operable toreceive a user datapoint containing data associated with a user of thedevice and generate at least one variable script element based on theuser datapoint.

In some embodiments, the call to action processor further operable toselect one of a plurality of possible script elements to obtain aselected script element and communicate over a network the selectedscript element.

In some embodiments, the call to action processor is further operable toreceive over a network an information item from the device; determinewhether the information item from the device meets a condition; andpresent a first call to action script via the device if the informationitem meets the condition; and present a second call to action via thedevice if the information item does not meet the condition.

In some embodiments, the call to action processor is further operable toreceive an indication from a device whether a user of the device is infocus. If the user of the device is in focus, the call to actionprocessor presents a first call for action script element via thedevice. If the user of the device is not in focus, the call to actionprocessor presents a second call for action script element via thedevice.

In some embodiments, the call to action processor is further operable todetermine whether a response has been received by the device. If noresponse has been received by the device, the call to action processorpresents via the device a no-response message indicating that noresponse has been received. If a valid response has been received by thedevice, the call to action processor presents via the device a validresponse message indicating that a response has been received. If aninvalid response has been received by the device, the call to actionprocessor presents via the device an invalid response message andcommunicating another call for action script.

In some embodiments, the call to action processor is further operable todetermine if the device receives a tap; perform a first operation if thedevice received the tap; and perform a second operation if the devicedid not receive the tap.

In yet another example embodiment, there is provided a non-transitorycomputer-readable medium having stored thereon one or more sequences ofinstructions for causing one or more processors to perform the call toaction procedures described herein.

BRIEF DESCRIPTION OF THE DRAWINGS

The features and advantages of the example embodiments of the inventionpresented herein will become more apparent from the detailed descriptionset forth below when taken in conjunction with the following drawings.

FIG. 1 illustrates an example system for generating and traffickingcreatives in accordance with an example aspect of the present invention.

FIG. 2 illustrates a block diagram of an exemplary creative developmentplatform including the applications executed by a creative generatorserver and a creative trafficking server in accordance with an exampleaspect of the present invention.

FIG. 3A illustrates a graphical user interface in accordance with anexample aspect of the present invention.

FIG. 3B illustrates a graphical user interface in accordance with anexample aspect of the present invention.

FIG. 3C illustrates a graphical user interface that is used to renderfields related to creative media content in accordance with an exampleaspect of the present invention.

FIG. 3D illustrates a graphical user interface that is used to renderfields related to creative voiceover content in accordance with anexample aspect of the present invention.

FIG. 4 depicts an example process for generating a creative inaccordance with an example aspect of the present invention.

FIG. 5 illustrates an exemplary voiceover workflow definition for avoiceover approval process which can be executed with other workflowsasynchronously in accordance with an example aspect of the presentinvention.

FIG. 6 illustrates another exemplary voiceover workflow definition for avoiceover approval process which can be executed by a voiceover requestprocessor and mixer with other workflows asynchronously in accordancewith an example aspect of the present invention.

FIG. 7 is a diagram illustrating a system for automating the generationof a creative in accordance with an example embodiment of the presentinvention.

FIG. 8 is a diagram illustrating a system for automating the generationof a creative in accordance with an example embodiment of the presentinvention.

FIG. 9 illustrates a process and embedding space in accordance with anaspect of the present invention.

FIG. 10 illustrates a diagram of a mixing system in accordance with anexample aspect of the present invention.

FIG. 11 illustrates a dynamic call to action process in accordance withan example aspect of the present invention.

FIG. 12 illustrates a dynamic call to action process in accordance withan example aspect of the present invention.

FIG. 13 illustrates an example personalized spot, a generic spot andbackground music in accordance with an example aspect of the presentinvention.

FIG. 14 illustrates a delivered audio file that has been created inreal-time in accordance with an example aspect of the present invention.

DESCRIPTION

FIG. 1 illustrates an example system for generating and traffickingcreatives. Not all of the components are required to practice theinvention, and variations in the arrangement and type of the componentsmay be made without departing from the spirit or scope of the invention.As used herein, the term “component” is applied to describe a specificstructure for performing specific associated functions, such as aspecial purpose computer programmed to perform algorithms (e.g.,processes) disclosed herein. The component can take any of a variety ofstructural forms, including: instructions executable to performalgorithms to achieve a desired result, one or more processors (e.g.,virtual or physical processors) executing instructions to performalgorithms to achieve a desired result, or one or more devices operatingto perform algorithms to achieve a desired result. System 100 of FIG. 1includes wide area networks/local area networks (“LANs/WANs”)—(Network)102, wireless network(s) 104, client devices 106-1, 106-2, 106-3, 106-4,. . . , 106-n (referred to collectively and individually as clientdevice 106), a creative generator server 108, a trafficking server 110,a media distribution server 112 and one or more external systems 114-1,114-2, . . . , 114-n (referred to collectively and individually as anexternal system 114).

Wireless network 104 is configured to communicatively couple clientdevices 106 and their components with network 102. Wireless network 104may include any of a variety of wireless sub-networks that may furtheroverlay stand-alone ad-hoc networks, and the like, to provide aninfrastructure-oriented connection for client devices 106. Suchsub-networks may include mesh networks, wireless LAN (WLAN) networks,cellular networks, and the like. Other now or future known types ofaccess points may be used in place of network 102 and wireless network104.

Generally, the creative generator server 108 and trafficking server 110cooperatively operate to generate and traffic creatives. In someexamples, a creative is in the form of a media content item. Forsimplicity as used herein, a creative media content time is sometimessimply referred to as a creative. Input specifying criteria for acreative is input via an input interface of an external system 114. Inan example, embodiment, the input is provided to external system 114 viaa client device 106 (e.g., client interface 106-4. In turn, the input iscommunicated to creative generator server 108 (via, e.g., WAN/LAN 102).Creative generator server 108 receives the input through from thenetwork (e.g., WAN/LAN 102) and executes creative generationapplications asynchronously. Trafficking server 110 executes traffickingworkflows asynchronously for the purpose of communicating the creativesgenerated by creative generator server 108 to targeted media-playbackdevices. Each creative is, in turn, communicated through network 102 toa client device 106 that has been targeted to receive the creative. Theclient device 106, in turn, plays the creative.

System 100 also includes a media object store 116 that stores mediaobjects, a creative store 118 that stores creatives that have beengenerated by creative generator server 108, a user activity/demographicsdatabase 120 that stores user activity and demographic data, aninteraction database 122 that stores activity profiles associated withaccounts (e.g., of users), and a vector database 124 that stores vectorsin accordance with the embodiments described herein.

In one example embodiment there is provided an automated creativedevelopment platform that performs asynchronous execution of creativegeneration workflows and trafficking workflows via a message queue. Theplatform includes creative platform components that operate according tocustom workflow definitions to manage such creative generation andtrafficking workflows during execution. A workflow definition representsa process and describes the tasks involved in the process. Workflowdefinitions can include properties, events, methods, protocols,indexers, and the like. A workflow can be defined for one specializedcomponent. In some embodiments a workflow can be defined for more thanone specialized component. A specialized component can have multipleworkflow definitions. The two workflows can reflect two differentprocesses the specialized component can perform. In some embodiments, aspecialized component can be involved in more than one workflow at atime. In some embodiments, the workflows can operate asynchronously.

The following non-limiting examples are described in terms of generatinga creative that includes audio objects that have been previously storedin media object store 116. This description is not intended to limit theapplication of the example embodiments. In fact, after reading thefollowing description, it will be apparent to one skilled in therelevant art(s) how to implement the following example embodiments inalternative embodiments. For example, by extending the platform togenerate and traffic unique targeted creatives containing other types ofmedia objects (e.g., video, text, etc.) in a variety of formats, andwhether stored in media object store 116 or provided from a differentsource.

FIG. 2 illustrates a block diagram of an exemplary creative developmentplatform 200 including the creative platform components executed by thecreative generator server 108 (FIG. 1) and creative trafficking server110 (FIG. 1). In an example embodiment, creative platform componentsinclude an audio generator 206, a voice request processor 208, a mixer210, and voiceover generation service 212. Creative platform componentsalso can include a targeting processor 218, audience generation service220, and a content provider database 222. Creative platform componentsalso can include a trafficking and performance tracking processor 214and a creative distribution server 216. The features and advantages ofthe creative platform components presented herein will become moreapparent from the detailed description set forth below when taken inconjunction with the respective drawings.

An input interface 202 contains definitions used to mediate the exchangeof information between the creative platform components of creativedevelopment platform 200 as well as external systems 114 (FIG. 1) thatcan provide external sources of data (i.e., data that is external tocreative development platform 200).

In some embodiments, input interface 202 provides a control configuredto receive input data to modify the definitions. In some embodiments,the control can take the form of a user interface (UI) designed into adevice with which a person may interact. This can include displayscreens, keyboards, and/or a mouse or other input device that allow auser to interacts with the input interface 202 to modify the workflowdefinitions or applicable data. The modification to the workflowdefinitions, in turn, generate modified workflow definitions that areused to generate one or more creatives having specified properties. Insome embodiments, such modifications to the workflow definitions modifythe traffic properties that define how the creative is trafficked. Forexample, input interface 202 can be configured to adjust input datathrough the use of an editor that receives input to vary the individualproperties of the input data (e.g., data elements originally entered viainput interface 202, such as tone, rhythm, etc.).

In one non-limiting example, input interface 202 can receive descriptioninformation that contains data elements (e.g., attributes) describing aparticular deliverable (e.g., targeted creative). The input is saved asone or more creative input objects containing data elements defining aparticular deliverable.

In some embodiments, the input data can be provided through inputinterface 202 includes, for example, background media content, a scriptfor a voiceover, a tone of a voiceover, one or more targetingparameters, one or more timing parameters. Examples of such informationincludes a name of a song or track identifier (ID), voiceover script ID,emotional tone and rhythm, time(s) and date(s), images, and othermetadata, correspondingly.

With reference to both FIGS. 1 and 2, in some embodiments, creativedevelopment platform 200 includes an application programming interface(API) 204 that processes the data provided from/to the interface 202. Asshown in FIG. 2, API 204 is between the input interface 202 and variouscomponents of creative development 200 (e.g., servers and functionsthose servers perform) that in conjunction are used to generate acreative containing media objects such as images, an audio segments,and/or video clips, automatically.

The parameters of the input data are processed by the correspondingcreative platform components of creative development platform 200.Different kinds of targeted requests, for example, have respectiveflows. In addition, these different sequential steps are performed onthe input data. Such creative platform components perform: mixing,transcoding, sending emails, and the like. Together the creativeplatform components of creative development platform 200 generate acreative in the form of a targeted media content item.

Example aspects provide a definition of the workflow and workers thatperform the various steps within the workflow. Workflows are processedby workers which are programs that interact with processors thatcoordinate work across components of the creative development platform200 to get tasks, process them, and return their results. A workerimplements an application processing step. In some embodiments, theworkflows executed by the workers provide recovery mechanisms, retrymechanisms, and notification mechanisms.

Each function described above in connection with FIG. 2 is automated.Automation is used, for example, to create the parameters that areincorporated in the creative, to generate audio, and to controltrafficking.

Each of the steps of a workflow is performed by the various functions isperformed asynchronously. As such, one function flow is not waiting forthe result of another function flow. Once a series of steps areinitiated those steps are performed in the background by the workers. Aview of the output (i.e., a view of a media object) is returned via aninterface. Optionally a view of the output is returned via an interfaceat each step. If necessary, a notification is issued (e.g., via aninterface) requesting additional input. The individual workflows areperformed asynchronously. A responses initiated within each flow (e.g.,a notification or request for additional information) that arecommunicated through, for example, the interface, are synchronous.

The example embodiments execute a number of workflows depending on theinput they receive. For example various types of input can be receivedthrough the interface. Depending on the type of input, a differentworkflow is performed. For example, if a media content item or locationof a media content item (e.g., a background track) is input, oneworkflow is performed. If no such input is received, then anotherworkflow is performed, for example, which either requests or otherwiseobtains a different type of input.

In an example embodiment, logic determines, based on some combination ofinputs, a particular flow that should be implemented. Each flow returnsa result (e.g., a return value such as a Boolean value). If each step issuccessful (as defined by a predetermined measure of success), theworker returns a success message, and the manager for the entire flow orpipeline knows to step the media object (e.g., an audio advertisement tobe transmitted) to its next successful state based on the workflowdefinition. If a failure during the flow occurs, the individual workflowcan handle the failure mode itself. In some embodiments, the workflowmay not be capable of resolving the failure mode but, according to acorresponding workflow definition may be arranged to retry a sequence ofsteps. In other words, the workflow, workflow definition and type oferror dictate the response and output. For example, if the cause of thefailure mode is the workflow itself, the workflow definition may have asolution to the failure that caused the failure mode. In someembodiments, a first workflow may be expecting data from anothercomponent of the system and not receive it in a timely manner. In onenon-limiting example, the first workflow can continue moving forwardthrough its steps without waiting for the data to be prepared e.g., by asecond workflow because the data needed by the first workflow is stillbeing prepared by the second workflow and may take additional time toprepare.

In an example embodiment, each independent routine, e.g., waiting for avoiceover, generating a new voiceover project, mixing, and traffickingare workers in the pipeline manager. Every worker has a defined logicthat it performs. A mixing worker, for example, calls scripts thatperform certain functionality. If the mixing worker performs the scriptssuccessfully, the mixing worker causes a mixed media object (e.g., audioadvertisement) to be stored in memory so that it can, in turn, beaccessed for other steps and returns a message indicating that itexecuted its flow successfully. If for example, the mixing workerperforms a script that fails, then the mixing worker returns a messageor value indicating that it has failed. The term “script” is used hereinin the context of computer science and in the context of writings.

In the context of computer science, the term script is used as a list ofcommands that are executed by a certain program or scripting engine.Scripts may be used to automate processes on a component.

In the context of writing, a script is the letters or characters used inwriting. A voiceover, for example, can be read from a script and may bespoken by someone who appears elsewhere in the production or by aspecialist voice talent. In some embodiments, the voiceover issynthesized using, for example, a text to speech synthesizer thatconverts the script to speech.

Every worker also has its own definition for what is successful. In thecase of a mixing worker, for example, if an internal process in themixing stage has determined that an internal stage has failed (e.g., avoiceover is silent indicating that the voiceover mixing has failed),then the mixing worker returns a message indicating that the mixingstage has failed. Every worker has its own definition of what issuccessful and what is a failure.

Example embodiments described herein can also provide automated routingand failure (e.g., retries) and recovery handling (e.g., fallback). Inaddition, the embodiments allow the various functions to be modular andfor different workflows to be defined. If one worker fails, the logicfor how it would fallback is dependent on the type failure. Each workercan thus be performed more than one time safely.

In an exemplary embodiment, the individual creative platform componentsmay not be part of a sequential workflow. In other words, they do notknow that they are going to flow at all, they just know that they mightbeing called. This allows the manager to be untethered to any particularworkflow.

Pipeline manager is given all of the workers and workflow definitions.The pipeline manager, using the workflow definitions executes theworkers in sequence and manages predefined successes and failures.

Graphical User Interfaces for Receiving Creative-Related DatapointValues

FIGS. 3A, 3B, 3C and 3D illustrate graphical user interfaces that can beprovided (e.g., rendered on an interface of a device) by user inputinterface 202. As used herein, a graphical user interface is a form ofuser interface that allows users to interact with a computer orelectronic device through graphical icons or visual indicators usingitems such as windows, icons, command links, radio buttons, check boxes,text boxes, and menus. In some embodiments, a graphical user interfacepresents such items to obtain datapoint values. In turn, the datapointvalues obtained via the graphical user interfaces are used to generateand traffic creatives.

The user interfaces depicted in FIGS. 3A, 3B, 3C and 3D can be used toprovide selectable or fillable fields to obtain datapoint values (alsoreferred to as input data, signals or simply datapoint). In turn, thedatapoint values are processed by creative development platform 200 togenerate and traffic creatives.

In some embodiments, at least some of the datapoint values are obtainedthrough other mechanisms (e.g., a push or pull data flow model). In someembodiments, API 204 (FIG. 2) provides a set of functions allowing theother applications of creative development platform 200 to access thedata. For example, API 204 can provide file input/output functions thatcause a file to be copied from one location to another without requiringany user input.

It should be understood that the fields shown in FIGS. 3A, 3B, 3C and 3Dare exemplary. Fewer, more, or different fields can be used to generateand traffic a creative.

Referring to FIG. 1, a creative (also sometimes referred to hereininterchangeably as “targeted media content”) is trafficked over anetwork (e.g., 102, 104) to targeted devices such as client devices 106.

One example use case involves an external system 114 in communicationwith creative development platform 200. In this example, the externalsystem 114 is a promoter system communicating a request for a creative.In response to the request, creative development platform 200 obtainsone or more specific datapoint values corresponding to an event.

In other embodiments, the datapoint values can be obtained from awebsite or other database (e.g., of the external system 114). Thesevalues are, in turn, used to populate corresponding fields requested byinterface 202. The website and database can include structured,unstructured data or a combination of both structured and unstructureddata. For example, required information can be obtained using datascraping techniques. For instance, if a promoter system (e.g., externalsystem 114) requests a creative for a particular concert, inputinterface 202 supplies input fields corresponding to elements of arecord. In turn, signals (also referred to interchangeably as datapointvalues or parameters) such as a date of a concert, a band name, bandartists, images or other media content related to the artists,demographic information about the artist or artist fans, or the like,are retrieved from one or more external systems 114 (e.g., a website ordatabase via an API 204. Creative generator server 108 populates theinput fields of the record with the datapoint values automatically.Additional attributes related to the event (e.g., band or particularconcert) can be retrieved from plural independent external systems 114(e.g., databases and/or websites).

Any remaining fields necessary for creative generator platform 200 togenerate a creative can be input through a graphical user interface(GUI) via a client device 106.

FIG. 3A illustrates an example graphical user interface 300A thatrenders an advertisement (“ad”) objective section 304 and an ad namesection 306. The ad objective section 304 provides campaign objectivefields for obtaining input data corresponding to campaign (e.g.,advertising, or promotional) objectives.

It should be understood that an ad objective is a type of campaignobjective. Accordingly, other campaign objective types can be used inplace of an ad objective and still be within the scope of the invention.Campaign objectives are the goals of advertising or promotionalmessages. Campaign objectives are used to craft messages, define targetaudiences and measure results. Example campaign objectives typesinclude:

-   -   Sell: to directive sell a product or service.    -   Demand Generation: to generate demand for an existing product        without directly selling it with the ad.    -   Lead Generation: to identify leads for sales processes.    -   Engage Target Market: to engage potential customers with        information, entertainment and participation with a brand.    -   Engage Customers: to engage existing customers to improve        loyalty and customer lifetime value.    -   Engaging Influencers: to engage a group that have influence over        a product.    -   Persuade: to persuade audiences about a topic or issue.    -   Reputation: to build a positive reputation for a firm, brand or        product in the eyes of stakeholders.    -   Inform: to inform customers about products.    -   Market Research: to collecting information for purposes such as        strategy and product development.    -   Brand Awareness: to increase the number of customers who        recognize a brand and associate it with a product category and        qualities such as taste or durability.

In the example user interface depicted in FIG. 3A, the campaignobjective (“ad objective”) field corresponds to selecting an campaignobjective relating to promoting a brand, a business, and/or anorganization. The other campaign objective field corresponds toselecting a campaign objective relating to promoting a concert ormusic-related content.

The creative that is generated is based on the type of campaignobjective that is selected via the campaign objective (“ad objective”)section 304. Graphical user interface 300A also includes an ad namesection 306. Ad name section 306 provides fields that can be used toreceive ad name information.

FIG. 3B illustrates an example graphical user interface 300B thatrenders a demographic audience section 310, a listening behavior section312, and a budget and schedule section 314. The demographic audiencesection 310 provides fields for obtaining demographics datapoint valuesrelating to one or more groups that a content provider wishes to target.As shown in FIG. 3B, demographics audience section 310 presents fieldsfor obtaining locations data, age data and gender data. The listeningbehavior section 312 presents fields for obtaining genre information anddevice operating system platform datapoint values. Budget and schedulesection 314 provides fields related to start-end dates/times and totalbudget. The information collected via the demographic audience section310, the listening behavior section 312 and the budget and schedulesection 314 are used to determine, for example, how often a creative isdistributed.

In an example embodiment, the above data can be stored inactivity/demographics database 120 (FIG. 1).

Optionally, the budget and schedule section 314 of graphical userinterface 300B includes a payment method, or link or portal to effectpayment.

Optionally, an input data summary display summary window 316 isprovided. In an example embodiment, the input data summary displaysummary window 316 displays a summary of the locations of targetedlisteners (e.g., by country, region, state and/or designated market area(DMA)), as well as age range, gender, and/or platform. Also included inthe summary information associated with the cost of the advertisements,active date range and the like.

FIG. 3C illustrates a graphical user interface 300C that is used torender fields related to creative media content in accordance with anexample aspect of the present invention. In some embodiments, thegraphical user interface 300C includes an audio creative section 318 anda display creative section 320. As shown in FIG. 3C, an audio tab 317allows an operator to select an option to upload an audio file as anaudio creative. As described below in connection with FIG. 3D, avoiceover request tab 323 allows an operator to select an option togenerate a voiceover file as the audio creative. The voiceover file can,in turn, be stored (e.g., in a store, such as creative store 118, orother store). For convenience, a store that stores a voiceover isreferred to herein as a voiceover store. Similarly, a store that storesa music file is referred to herein as a music file store.

The audio creative section 318 and display creative section 320 are usedto render fields related to desired media content components of acreative. In an example embodiment audio creative section 318 of thegraphical user interface 300C provides a mechanism for uploading one ormore audio files, image files and/or video files. For example, a desiredaudio creative can include an uploaded audio file.

FIG. 3C also illustrates a graphical user interface 300C that can beused to render fields related to a display creative. As shown in FIG.3C, in the display creative section 320, a companion image, a headline,and a click URL (uniform resource locator) can be input. Yet anothersection of the graphical user interface 300C can be constructed toprovide an advertisement display preview 322 for both mobile devices anddesktop computers.

In some embodiments the audio creative section can include an option toautomatically select an audio file. An example implementation of anautomated search for ad background music is described below inconnection with FIG. 9. The creative development platform 200, forexample, can be configured to obtain a media content item that isacoustically similar to a query track.

FIG. 3D illustrates a graphical user interface 300D that is used torender fields related to a creative voiceover in accordance with anexample aspect of the present invention. In some embodiments, thecreative can be include a voiceover. As shown in FIG. 3D, a voiceoverrequest tab 323 that allows an operator to select an option to generatea script for a voiceover. In some embodiments the voiceover can be mixedwith an audio file discussed above in connection with FIG. 3D, asdescribed below in more detail in connection with FIG. 10. Interface300D includes voiceover title section 324, a script section 326, avoiceover instruction section 328, a language section 330, a voiceprofile section 332, a background track section 334, and a displaycreative section 336. Voiceover title section 324, script section 326,voiceover instruction section 328, a language section 330, a voiceprofile section 332, background track section 334, and a displaycreative section 336.

Voiceover input data enables an operator the ability to write a scriptto be used as a voiceover. Voiceover title section 324 provides an inputfield that receives a title of the voiceover. Script section 326provides script input fields that are used to obtain a script to be readby a voiceover mechanism. The voiceover mechanism can be a technologicalvoiceover mechanism such as a text to speech audio mechanism. In someembodiments, the input that is received by script section 326 arecommunicated over a network to another system that presents the scriptto a voiceover actor who reads the script according to the parametersinput through the user interfaces described in connection with FIGS. 3A,3B, 3C and 3D. In some embodiments, voiceover script input fields ofscript section 326 include a pace for the voiceover script to be spoken.As shown in FIG. 3D, the language input field of language section 330 isa pulldown menu which allows an operator to select the particularlanguage to be used. Voiceover profile section 332 allows voiceoverprofiles to be presaved. A background track can also be uploaded orselected through background track section 334. Display creative section336 includes a headline field and click URL field. Headline field andclick URL field are used to receive input data related to a companionimage, a headline, and a click URL.

In some embodiments the audio creative section can include an option toautomatically select a voiceover file. An example implementation of anautomated search for ad background music is described below inconnection with FIGS. 10 and 11. The creative development platform 200,for example, can be configured to dynamically generate a voiceover orportions thereof.

In other embodiments, these media content components operate as seedcomponents that creative development platform 200 uses to select othersimilar or otherwise more appropriate components to be included in thecreative that is generated. In other embodiments, these media contentcomponents are images, audio or video content that correspond to theartist, event, band, or the like, that can be used to provide signalssufficient for creative development platform 200 to generate a creative.For example, if an image of an artist is uploaded, creative developmentplatform 200 can be used to search external systems 114 for anyinformation about the artist such as the dates of future concerts. Inturn, creative development platform 200 can perform natural languageprocessing and execute natural language understanding algorithms todetermine other signals that can be used to automatically generate acreative. Such media content can be stored, for example in media objectstore 116 (FIG. 1).

Example Creative Generation Processes

FIG. 4 depicts an example process executed by one or more processors ofcreative development platform 200 for generating a creative inaccordance with an example aspect of the present invention. Initially,at block 402, the creative generator server 108 of FIG. 1 receivesdatapoint values (as noted above, also referred to as input data orsignals) that contain information used to generate and traffic thecreative. In one embodiment, an identifier associated with a promotersystem, a list of dates and/or a value corresponding to a budget for acreative are used to initiate the creative generation and traffickingprocess are provided by these signals. For convenience, this data iscollectively referred to as initial creative parameter values. Theseinitial creative parameter values are used to obtain any additionalsignals necessary to generate and traffic targeted media content.

In block 404, a determination is made as to the type of creative thatshould be generated: e.g., an audio, video or text creative. In thefollowing example, an audio-based creative is generated. Thisdetermination can be made, for example, based on the playbackcapabilities particular client devices 106 that will receive thecreative. For example a vehicle appliance may only receive audio,whereas a mobile phone can receive audio and visual content.

In block 406 a first media object (e.g., in the form of a media file) ora preexisting media object is obtained and uploaded through interface202.

In one embodiment, a determination is made as to whether the first mediaobject for the creative should be generated based on a pre-existingmedia object described above (also referred to as a first preexistingmedia object) or whether a different media object (also referred to as afirst new media object) should be generated.

The first media object can be obtained automatically based on predefinedcriteria, by comparing its metadata to one or more signals receivedthrough input interface 202 and selecting the best match. Now known orfuture developed mechanisms for selecting the best match can be used.

In turn, in block 408, the first media object is edited, based on, forexample, specific targeting and scheduling criteria. Depending on thetarget device, additional text can be inserted. For example if a devicecan receive feedback through sensors (e.g., accelerometer, microphone,and the like), then the script may be edited to receive a response fromthe device. In some embodiments, depending on the action taken, thescript can dynamically change. It should be understood that, as usedherein, a script for a voiceover can be composed of several scriptelements.

A determination is made in block 410 as to whether an additional mediaobject should be overlaid on top of the media content in the first mediaobject. If so, in block 412, an additional media object is obtained.Particularly, in block 412, a determination is made as to whether theadditional media object should be a preexisting media object (alsoreferred to as an additional preexisting media object) or a differentmedia object (also referred to as an additional new media object).

In one example use case, the first (preexisting or new) media object isin the form of an audio file and the additional (preexisting or new)media object is the form of a voiceover audio file. The first mediaobject and additional media object are processed so the additional mediaobject content is overlaid on top of the first media object content(e.g., voiceover content is overlaid on top of audio content such as amusic track), as shown in block 414. Additional editing is performed ifnecessary as well.

In one embodiment, the additional editing is performed automatically.

In another embodiment, the additional editing is performed partiallyautomatically.

In yet another embodiment, the editing is performed manually throughinput interface 202 of the creative development platform 200.

If a determination is made at block 410 that the additional media object(e.g., such as a voiceover) should be created, then creative developmentplatform 200 determines an additional media object name for theadditional media file (e.g., the name of the voiceover). In an exampleembodiment, this name will also be reflected as a project name and acampaign name. In the case where the additional media file is avoiceover, platform 100 receives a script for the voiceover. The scriptcan be text-to-speech translated by processor (e.g., a text-to-speechprocessor). Optionally, platform 100 translates, using a translationmodule (not shown), the voiceover to one or more languages based oncorresponding signals, e.g., the concert is in France, therefore thelanguage is French. If any of the signals received through interface 102indicate the content of the additional media file should be generated ormanipulated a certain way, then the additional media file is processedaccordingly, e.g., such that certain terms are stated with inflection oremphasis points, tone, or other information. In an optional embodimentthe signals received by input interface 102 provide sufficientinformation to determine the demographic attributes of the additionalmedia object, e.g., the language or general tone of the voiceover.

As described below in connection with FIGS. 11 and 12, the script forthe voiceover can be generated dynamically. Thus in some embodiments,instead of obtaining a voiceover file, voiceover script sections arecombined dynamically.

In some examples embodiments, a method, a system, an apparatus and acomputer-readable medium are provided for analyzing previously-consumedcreatives to generate a model that can be used to generate or otherwiseconfigure the attributes of a creatives (e.g., the audio file, thevoiceover file, the companion image, etc.). In an example embodiment,previously-consumed creatives are analyzed to determine which attributesof the creatives are most effective in, for example, driving action.

Attribute categories include objective, tone, music, assets, brandvisibility, creative metadata, call-to-action categories, and the like.The objective may be what the new creative is targeted to. The tone maybe the sound with respect to its pitch, quality and strength. Music maybe the audio content that is included in the creative. Assets may becertain content that may be included in the creative, such as avoiceover script. Brand visibility may be how visible a brand is in thecreative. Creative metadata may include various information about thecreative. A call-to-action may be information included in the creativethat requests an action to be performed by the user. The variousattribute categories can be broken down into additional attributes.

The attributes are, in turn, fed to a processor which executes analgorithm that causes the processor to generate a model that is used togenerate new creatives.

An analysis module 418 can be used to process previously-consumedcreatives (e.g., creatives that have been consumed during a certain timeperiod). In one example implementation, analysis module 418 identifiesattributes in the creatives by using automatic identification processes,such as natural language processing (NPL), audio processing tools, andvideo processing tools that analyze the speech content and audioattributes of a creative. NPL and audio processing tools can be used,for example, to recognize the speech in a previously-consumed creativeto recognize certain phrases, artists, tone attributes, and the like.Visual recognition, text recognition, audio recognition and the likealso may be used to determine or infer the attributes of thepreviously-consumed creatives. The attributes obtained using thesetechniques can be input into, for example, a table in a database.

Analysis module 418 can also be used to determine what extent theattributes of previously-consumed creatives had an effect on consumersof the previously-consumed creatives. Analysis module 418 may input thedetected information into a machine-learning algorithm that is used totrain a model that predicts attributes of creatives that correspond toparticular signal(s).

In one example use case, a particular signal may indicate the targetconsumer is over a certain age or a member of a certain demographic. Aparticular phrase or script that has been predicted to be most effectivefor this age group or demographic (e.g., that will translate tocalls-to-action) will be obtained and used to create a new creative. Inother words, the analysis module 418 predicts the effectiveness.Effectiveness may be measured by a quantifiable measure, for example, aclick-through rate, sell-through rate, a referral rate, brand recall, orsome combination of these or other measures. For example, it may bedetermined that a first script is most effective for a first type ofconcert promotion while a second script is more effective for a secondtype of concert promotion.

The analysis module 418 can thus build a model (also referred to as amachine-learning model) that is used to predict the attributes of a newcreative.

A database may also be used to store measured statistics for thepreviously-consumed creatives, such as demographics statistics as shownin FIG. 1 (user activity/demographics DB 120). These demographicsstatistics relate a creative to which audience might be relevant for thecreative. For example, classical music concert promoters may beinterested in listeners over a predetermined age. Game company promotersare interested in gamers.

In some embodiments, both background music and the words a voiceovermechanism (or artist) is speaking are provided automatically and theaudio levels are set when mixing the two. The machine automates theprocesses typically performed by the mixing engineer allowing for thescalable creation of creatives containing audio. In some examples, givena voiceover audio file (e.g., a first media object) and a separatebackground music file (e.g., a second media object), an algorithm isexecuted by an audio generator 206, a voiceover request processor 208, amixer 210, and a voiceover generation service 212 that collectivelygenerate a voiceover mixed with background music in an automatedfashion. This takes into account music lead-in time, volume normalizing,and balance between voiceover and background music. Parameters of theprocessing chain are estimated from the audio content, including theequalization parameters (estimated using the audio's frequency content)and the music lead-in time (using estimates of the background music'sonset patterns.)

Voiceover Workflow Definitions for Voiceover Approval Process

FIG. 5 illustrates an exemplary voiceover workflow definition for avoiceover approval process which can be executed with other workflowsasynchronously. Referring again to FIG. 2, this process can be performedautomatically by, for example, voiceover request processor 208. In block502, the process waits for a voiceover. Once the voiceover is received,in block 504, the voiceover is reviewed and a determination is made asto whether the voiceover is approved or rejected. If a determination ismade in block 504 that the voiceover is rejected, a new voiceoverproject is generated as shown in block 506. If a determination is madein block 504 that the voiceover is approved, then in block 508 thevoiceover is mixed by mixer 210 and in block 510 trafficking andperformance tracking processor 214 and creative distribution server 216traffic the voiceover to targeted devices (e.g., client devices 106) ona network such as wireless network 104.

FIG. 6 illustrates another exemplary voiceover workflow definition for avoiceover approval process which can be executed by voiceover requestprocessor 208 and mixer 210 with other workflows asynchronously. Inblock 602, the process waits for a voiceover. Once the voiceover isreceived, in block 604 the voiceover is queued for review. At block 606the voiceover is mixed and a preview creative is generated. A review ofthe preview creative is performed at block 608 and a determination ismade as to whether the preview creative is rejected or approved. Ifrejected, then a determination is made at block 610 as to the reason forthe rejection. For example, if the reason is because the voiceover doesnot meet a particular criteria thereby indicating the voiceover is bad,then at block 612 the voiceover request is regenerated and the flowreturns to block 602. If the determination made at block 610 that themixing process does not meet a predetermined criteria, then this mixfailure is logged and a message is communicated to the appropriatecomponent associated with the project indicating this, as shown in block614. For example, a message providing the log can be transmitted to theparty responsible for the project. If a determination is made at block608 that the preview creative is approved, then the preview creative isapproved by an approval process, as shown 616. Once approved, at block618 the final mix is trafficked, by creative distribution server 216 ofFIG. 1, for example, at the direction of the trafficking and performancetracking processor 214.

Content can be stored in content provider database 222. As will bedescribed below in more detail a targeting processor 218 operates todetermine target audiences. In some embodiments, the targeting processor218 operates in conjunction with an audience generation service 220which in turn is supplied content provided by a content provider whosecontent is stored in content provider DB 222.

Block 504 of FIG. 5 and block 616 of FIG. 6 will now be described inmore detail with reference to FIG. 2. Audio data that includes speechmay be transcribed by a voice transcriber which operates under thecontrol of the voiceover generation service 212 of FIG. 2 using alanguage model. The transcription may be provided to a voiceover reviewprocessor (not shown) which operates under the control of the voiceovergeneration service 212 of FIG. 2. In turn, voice over review processormay provide feedback on the transcription. In some embodiments, thelanguage model may be updated based at least in part on the feedback.The feedback from the voiceover review processor may include, forexample, an affirmation of the transcription; a disapproval of thetranscription; a correction to the transcription; a selection of analternate transcription result; or any other kind of response.

An automated grammar generator (not shown) also under the control of thevoiceover generation service 212 of FIG. 2 can be used to correct,revise or replace the proposed voiceover. In some embodiments, theautomated grammar generator identifies one or more parts of thevoiceover suitable for processing into a natural language expression.The natural language expression is an expression which a person mightuse to refer to the segment. The automatic grammar generator generatesone or more phrases from the segment, each of the one or more phrasescorresponding to or capable of it being processed into a naturallanguage expression or utterance suitable for referencing the text orspeech segment. Noun phrases and verb phrases and other syntacticstructures are identified in the speech or text segment, and modified toproduce typical natural language expressions or utterances a user mightemploy to reference a segment. Verbs in verb phrases may be modified inorder to provide further natural language expressions or utterances foruse in the grammar. The natural language expressions thus generated maybe included in grammars or language models to produce models forrecognition using an automatic speech recognizer in a spoken languageinterface.

Search for Ad background Music by Track

In some embodiments, a determination is made as to which media objectfrom a library of media objects is used for a creative. In an exampleembodiment, the workflows are defined by audio generator 206 of FIG. 1.In one example embodiment, an interface (e.g., input interface 202 ofFIG. 1) receives a query that, in turn, causes a search engine to searcha library. The search engine can be contained within audio generator 206or communicatively coupled to audio generator 206 via, for example,input interface 202 and/or API 204.

The library can be searched by, for example, using a query song asexplained below.

FIG. 9 illustrates a process and embedding space in accordance with anaspect of the present invention. Generally, the query process isaccomplished by using acoustic embeddings. Acoustic embeddings arederived directly from audio content.

The acoustic embeddings are used to map each track in a library oftracks to a point in an embedding space. In the example embodiment shownin FIG. 9, the acoustic embedding space 908 is derived directly from theaudio content attributes of a library of tracks 910 (e.g., backgroundmusic). Specifically, acoustic embeddings of the audio content of alibrary of tracks is performed to map each track in a library of tracksto points in the embedding space 908 based on plural attributes of thetrack, as represented by block 912. An acoustic embedding is alsocomputed for a query track within the embedding space.

N-tracks from the library of tracks that are nearest in the embeddingspace are determined and, in some embodiments, ranked by distance to thequery track.

With reference to both FIGS. 2 and 9, input data defining one or moreproperties are received, for example, through input interface 202 ofFIG. 2, and used to obtain acoustically similar media content that is,in turn, used for a creative, for example by mixing the media contentwith a voiceover. In this example, the input data received through inputinterface 202 is a query track 902. This input data can be, for example,in the form of a name of a song or track identifier (ID). the input datais used to obtain the features of the query track. The features of thequery track, in turn, are used to generate an acoustic embedding of thequery track 904. The acoustic embedding of the query track is mapped toa point 906 (also referred to herein as a query track embedding spacepoint) in the embedding space 908 of the library of tracks.

Acoustically similar tracks 914, particularly embedding pointsrepresenting tracks from the library of tracks 910 (e.g., N-tracks fromthe library of tracks 910, where N is an integer) that are nearest inthe embedding space to the point within the embedding space representingthe query track are, in turn, returned as represented by block 916. Thereturned tracks can be ranked by distance to the query track. Thereturned tracks can be returned to other workflows within system 200 orto an external system, e.g., via interface 202.

In one embodiment, a constant-Q transform is performed on the querytrack 904 to generate a time-frequency representation of the audiocontent of the query track 904. Next, a learned convolution function isperformed on the resulting constant-Q transform to project theconstant-Q transform into a smaller space. The weights and convolutionsare learned to place an attribute of the track that is the same as acorresponding attribute of another track from the library of tracks 910close together and further apart if they are different.

In one example embodiment, principal component analysis (PCA) is used toconvert the 1024-dimentional vector into a set of values of linearlyuncorrelated variables called principal components (or sometimes,principal modes of variation). The number of principal components isless than or equal to the smaller of the number of original variables orthe number of observations. In this case, the 1024-dimensional vector ismapped to an 8-dimensional vector.

In one example embodiment, a 1024-dimensional vector is generated forevery song in a database. The 1024-dimensional vector is multiplied by aconvolutional matrix that recombines the elements that are similar(i.e., elements that have a high covariance are preserved.

In another embodiment, a portion of a track can be used as the queryinput. For example, a section of a track can be provided through inputinterface 102 instead of the entire track (or pointer to the entiretrack, such as a track ID).

The above described mechanism for searching for ad background music by atrack is performed by one or more processors referred to herein as abackground music search processor. Particularly, when the functionsdescribed above are performed by the background music search processor,the background music search processor performs the methods describedherein related to searching for ad background music.

Automated Ad Voiceover Mixing

Another aspect of the present invention relates to systems, methods andcomputer program products that automate the processes typicallyperformed by a mixing engineer, thereby allowing for the scalablecreation of audio ads. With reference to FIG. 2, the components andprocesses that will now be described can be included in, for example,audio generator 206, voiceover request processor 208, mixer 210 and/orvoiceover generation service 212.

Generally, given a voiceover audio file and a separate background musicfile, an algorithm executed by a least one processor causes theprocessor(s) to mix the voiceover with the background music in anautomated fashion. This takes into account music lead-in time, volumenormalizing, and balance between voiceover and background music.Parameters of the processing chain are estimated from the audio content,including the equalization parameters (estimated using the audio'sfrequency content) and the music lead-in time (using estimates of theonset patterns of background music.).

FIG. 10 is a diagram of a mixing system 1000 according to an exampleembodiment. Generally, a volume subsystem 1002 standardizes the volume(also referred to as loudness normalization) of an audio file 1002-2 sothat the volume of the audio file 1002-2 is the same across a collectionof other recordings. After the audio file 1002-2 is converted to theappropriate format by channel converter 1002-4, e.g., to a singlechannel Waveform Audio File Format (WAV) file, a loudness units relativeto full scale (LUFs) measurement is taken by a LUFs meter 1002-6. A gainlevel controller 1002-8 (“LUFS Level”) adjusts the gain. For example,gain level controller 1002-8 reduces the gain if the audio file 1002-2is too loud. If, on the other hand the level is too soft, the peak levelof the audio file 1002-2 is measured by LUFs meter 1002-6 to determinewhether the gain can be raised by gain level controller 1002-8 withoutcausing distortion. If the track is breaching a distortion threshold,then the file is compressed or limited as needed by gain levelcontroller 1002-8.

Generally, a voice processor subsystem 1004 processes a voice file1004-2. Initially, the format of voice file 1004-2 is normalized to astandard sample rate bit depth wave file based on a predetermined voiceformat stored in voiceFormat store 1004-5 by format normalizer 1004-6.The volume is then normalized by a volume normalizer 1004-10 by using ameasurement of the LUFs of the voice file obtained from voiceLufs store1004-9, and raising or lowering the peaks (i.e., normalizing volume).The resulting, normalized voice file is then processed by a plosivesdetector 1004-12 to identify when plosives occur. Plosives are a bassy,often distorted sound, that results when an air blast from the mouthgoes into a microphone. The most common source is the letter P, which iswhy plosives are sometimes generically referred to as P-Pops. While theP sound is the most common sound that causes a plosive, there are plentyof other sounds that cause similar problems, such as the letter B.

Plosives are detected by measuring the energy level of the voice filewithin predetermined low or high frequency ranges. If energy exists inthe low or high frequency ranges in a particular distribution thatexceed a predetermined threshold, the regions in which such plosives aredetected are filtered out, thereby substantially eliminating unwantedplosives. In one embodiment, the high pass filter 1004-14 (or first highpass filter 1004-14) only high pass filters the regions in whichplosives have been detected. Another high pass filter 1004-16 (or secondhigh pass filter 1004-16) is used to reduce any low frequency hum thatmight be in the recording. In one embodiment, the parameter of thesecond high pass filter 1004-16 is set based on a fundamental frequencyof a voice indicating the gender of the speaker. A voice genderparameter that indicates the gender of the speaker can be preset inmemory, such as voiceGender store 1004-19. Alternatively, the pitch ofthe voice overall is estimated and an appropriate parameter is set. Thatway a label (e.g., gender) is unnecessary the parameter stores (i.e.,represents) a broader fundamental voice frequency.

Next the normalized and filtered voice file is processed by a dynamicequalizer 1004-18. Dynamic equalizer 1004-18 contains a statisticalmodel that has been pretrained by obtaining the statistics of thediscrete spectrum of plural, preferably well-mastered, voices. Thestatistics include, for example, a mean and variance of the discretespectrum. If any part of a spectrum of a new recording is outside of,for example one standard deviation of the mean of the set of recordings,then that part of the spectrum (e.g., a particular frequency band) isadjusted. In other words, the spectrum is readjusted so that it fallswithin the statistical range of the voices that are known to be properlyrecorded.

A single band compressor 1004-20 controls erratic volume (e.g., unequaland/or uneven audio volume levels) changes based on a voice fundamentalfrequency. In one embodiment, the voice fundamental frequency can be aset parameter.

A multiband compressor 1004-22 detects and adjusts any variance in eachfrequency band. In some embodiments, the multiband compressor 1004-22divides the frequency spectrum into different sections, or bands, sothat each has its own unique compression settings to mimic a good voicerecording. In one embodiment multiband compressor 1004-22 looks at thevariance of each frequency band and adjusts the variance in the voicerecording to be similar to a target (e.g., defined by a parameter invoiceGender store 1004-19 that is based on a database of voicerecordings that are well mastered). For example, the mean and varianceover a set of good recordings is determined. If a particular section isheavily compressed it will have a low variance in a particular frequencyband. Looking at the statistics of the variance in each frequency band,the multiband compressor 1004-22 is running on a particular frequencyrange and looking at the variability against the well mastered voicerecordings. Depending on the audio file, a particular frequency bandmight be compressed or expanded to make it match the well mastered voicerecordings defined in voiceGender store 1004-19.

A silence removal component 1004-24 removes any silence at the start orend of the voice file.

A pad silence component 1004-26 pads the voice file with silence at thestart or end of the voice file so that the voice file fits within adesired start time stored in voiceStart store 1004-25 (e.g., ½ second)and duration stored in adDuration store 1004-27 (e.g., 30 seconds). Theresult is a processed voiceover file 1004-32 that is stored in, forexample, a voiceover store.

In some embodiments, the voiceover is further processed to determinewhere in the file voice is present. This information is stored in avoice activations store 1006-15 described in more detail below.

Music processor 1006 processes a music file 1006-2. Initially, a formatnormalizer component 1006-6 normalizes the format of music file 1006-2to a standard sample rate bit depth wave file based on a predeterminedmusic format stored in musicFormat store 1006-5. The volume is thennormalized by a volume normalizer 1006-10 by using a measurement of theLUFs of the music file obtained from voice voiceLufs store 1006-9, andraising or lowering the peaks (i.e., normalizing volume). A trimmingcomponent 1006-12 trims the music file according to a predeterminedduration (e.g., ad duration) stored in adDuration store 1006-7.Alternatively, trimming component 1006-12 trims the music file by anamount received through an interface, such as input interface 202 ofFIG. 2. This input can be received via client device 106 or an externalsystem 114.

In some embodiments, the amount of the music file 1006-2 that is trimmedis determined based on selected acoustic feature(s). For example, if thedesired acoustic features for an advertisement is a guitar solo withouta singing voice, an acoustic-feature search component (not shown) isused to detect such acoustic features from one or more music filesstored in a music file database. If the desired acoustic-features arelocated in a music file, then that music file is used as music file1006-2, and the section of music file 1006-2 containing the guitar soloand no singing determines is extracted and the amount trimming (alsoreferred to as trimming parameters). Conventional or future developedmethods of detecting such detected acoustic features can be used.

To increase the LUFS, without changing the sound and balance of the mixa gain plugin can be inserted at the start of the chain. Compression,limiting or harmonic distortion can also be added to increase theloudness.

A single band compressor 1006-14 controls erratic volume changes.

The frequencies of voice that make the voice intelligible tend to be inthe higher frequency range that humans can hear. Depending on the musicin the music file 1006-2, the music may clash with a voice, such as thevoice in processed voiceover file 1004-32. To cause the voice inprocessed voiceover file 1004-32 to be more intelligible, a multibandcompressor 1006-16 in music processor 1006 is used to compress the top(i.e., predetermined) frequency range of the music file 1006-2 (e.g.,6000-20,000 Hz), such that it is active when a voice is speaking. Bydoing so, multiband compressor 1006-16 creates a space in a high rangethat permits the processed voice in processed voiceover file 1004-32 tobe more clearly understood. Voice activations store 1006-15 stores acurve corresponding to the voice in processed voiceover file 1004-32. Insome example embodiments the curve corresponds to when a voice ispresent in processed voiceover file 1004-32.

In one example embodiment, the curve is determined by measuring theenergy in the audio signals that make up the processed voiceover file1004-32. A low-pass filter filters the squared energy of the audiosignals. A logarithmic compressor compresses the filtered,squared-energy to force the range to close to between 0-1 and anythingabove a predetermined threshold (e.g., 0.5) is considered to be active(i.e., voice is present).

A fade controller 1006-18 performs fade-in and fade-out of the processedmusic file within a predetermined timeline. The result is a processedmusic file 1006-20, that is stored in a store such as media object store116 or creative store 118. For convenience, a store that stores a musicfile is referred to as a voiceover store.

A mixing processor subsystem 1008 receives the processed voiceover file1004-32 and the processed music file 1006-20 and further processes themto set a target weighting between the loudness of the processed music inprocess music file 1006-20 and the voice in processed voiceover file1004-32. This is possible because both the voice and music have beennormalized to a specific loudness volume as described above. Apredetermined background volume parameter stored in background volumestore 1008-7 indicates the amount of relative volume between thenormalized voice and music files (e.g., the background volume is 30% ofthe loudness of the voiceover volume). Weighted sum component 1008-6adjusts the volume of processed voiceover file 1004-32 and processedmusic file 1006-20 according to the background volume parameter and addsthem together. Single band compressor 1008-8 in mixing processorsubsystem 1008, in turn, flattens the volume out to ensure that thecombination of the content of the processed voiceover file 1004-32 andprocessed music file 1006-20 are uniform. The output file 1008-10 is theresult of the process performed by mixing processor subsystem 1008.

Voiceover processor subsystem 1010 determines the voiceover start time(voiceStart) which is stored in voice start time store 1010-6. Thevoiceover start time, voiceStart, stored in voiceover start time store1010-6 can be predetermined (e.g., 0.5 seconds).

Depending on the length of the voiceover in processed voiceover file1004-32, it may be desirable to shift the start time of the voiceoverwithin the processed music in processed music file 1006-20. For example,it may be desirable to start the voiceover right away or, alternatively,at the end of the first measure or beat of the music. Voiceover timing1010-10 shows an example voiceover start and an example voiceover end.In some embodiments, the voiceover start type is determined byperforming an analysis of the music file. In one example embodiments,the energy of the background music is measured and the voiceover starttime is chosen according to whether the energy meets predeterminedtolerance. In some example embodiments, a beat detector (not shown)executing a beat detection algorithm can be used to determine the timingof the processed music file (e.g., the beats of the music). The starttime can then be determined based on which beat the voiceover shouldstart (e.g., the first beat).

Asynchronous Execution

Before becoming a targeted media content file, the corresponding inputdata is processed by the various components of mixing described above.As shown above with respect to FIG. 10, each task has may have arespective flow and the different sequential steps of the respectiveflow need to be performed on the input data. In some embodiments, forexample, the volume subsystem 1002, the voice processor subsystem 1004,the music processor 1006, and the mixing processor subsystem 1008 areperformed asynchronously.

Example aspects provide a definition of the workflow and workers thatperform the various steps within the workflow. These aspects providerecovery mechanisms, retry mechanisms, and notification mechanisms.

In some embodiments, at least a portion of the steps performed by thevarious functions can be performed asynchronously. As such, one functionflow is not waiting for the result of another function flow. Once aseries of steps are initiated those steps are performed in thebackground by so-called workers. A view of the output (i.e., a view of amedia object) is returned via an interface. Optionally a view of theoutput is returned via an interface at each step. If necessary, anotification is issued (e.g., via an interface) requesting additionalinput. The individual flows are performed asynchronously, whileresponses back through, for example, the interface are synchronous.

The example embodiments execute a number of flows depending on input.For example various types of input can be received through theinterface. Depending on the type of input, a different workflow isperformed. For example, if a media content file or location of a mediacontent file (e.g., a background track) is input, one workflow isperformed. If no such input is received, then another workflow isperformed, for example, which either requests or otherwise obtains adifferent type of input.

In an example embodiment, logic determines, based on some combination ofinputs, a particular flow that should be implemented. Each flow returnsa result (e.g., a return value such as a Boolean value). If each step issuccessful and each worker returns a success message, the manager forthe entire flow or pipeline knows to step the media object (e.g., anaudio advertisement to be transmitted) to its next successful statebased on the workflow definition. If a failure during the flow occurs,knows how to handle the failure or retry a sequence of steps based onthe workflow or pipeline definition.

In an example embodiment, each independent routine, e.g., waiting for avoiceover, generating a new voiceover project, mixing, and traffickingare workers in the pipeline manager. Every worker has a defined logicthat it performs. A mixing worker, for example, calls scripts thatperform certain functionality. If the mixing worker performs the scriptssuccessfully, the mixing worker causes a mixed media object (e.g., audioadvertisement) to be stored in memory so that it can, in turn, beaccessed for other steps and returns a message indicating that itexecuted its flow successfully. If for example, the mixing workerperforms a script that fails, then the mixing worker returns a messageor value indicating that it has failed.

Every worker also has its own definition for what is successful. In thecase of a mixing worker, for example, if an internal process in themixing stage has determined that an internal stage has failed (e.g., avoiceover is silent indicating that the voiceover mixing has failed),then the mixing worker returns a message indicating that the mixingstage has failed. Every worker has its own definition of what issuccessful and what is a failure.

Example embodiments described herein can also provide automated routingand failure (e.g., retries) and recovery handling (e.g., fallback). Inaddition, the embodiments allow the various functions to be modular andfor different workflows to be defined. If one worker fails, the logicfor how it would fallback is dependent on the type failure. Each workercan thus be performed more than one time safely.

In an exemplary embodiment, the individual components may not be part ofa sequential workflow. In other words, they do not know that they aregoing to flow at all, they just know that they might being called. Thisallows the manager to be untethered to any particular workflow.

Pipeline manager is given all of the workers and workflow definitions.The pipeline manager, using the workflow definitions executes theworkers in sequence and manages predefined successes and failures.

FIG. 7 is a diagram illustrating a system for automating the generationof a creative in accordance with an example embodiment of the presentinvention. A service 701 contains a workflow definition store 702 and apipeline manager 704. A worker store 708 containing workers 710 ₁, 710₂, 710 ₃, . . . , 710 _(n) (e.g., Worker₁, Worker₂, Worker₃, . . . ,Worker_(n))(each individually and collectively 710) residesindependently from service 701. A message queue 706 that performsrouting is communicatively coupled to the service 701 and the workerstore 708. Commands (CMDs) are communicated by the message queue 706 tothe workers 710 to instruct the workers 710 to perform predeterminedtasks. In return, the workers 710 communicate back to the pipelinemanager 704 via message queue 706 a message indicating whether the taskthey performed was a success or failure (S/F). In turn, the pipelinemanager 704 determines the next step based on a workflow definitionstored in workflow definition store 702. In one example embodiment, thepipeline manager 704 does not hold the logic, but rather communicatethrough the message queue 706 to instruct the workers to perform tasks.In this embodiment at least one custom workflow definition is used. Inaddition, asynchronous execution via the message queue is performed.

In an example embodiment, at least a portion of the metadata used togenerate a creative is stored in a database prior to the creativegeneration process. Metadata includes assets that are available to eachworker. There can be exceptions. For example, additional information canbe added by a workflow. As part of the idempotent nature of the workers,for example, if one encounters a field that requires input and theinformation is not necessary, the worker will bypass (e.g., ignore) thatmissing field. Thus, with exceptions, metadata is available at the startof the creative process.

FIG. 8 is a diagram illustrating a system for automating the generationof a creative in accordance with an example embodiment of the presentinvention. Referring to FIG. 8, assets can be stored in asset database812 and made available to the workers 806 ₁, 806 ₂, 806 ₃, . . . , 806_(n) (e.g., Worker₁, Worker₂, Worker₃, . . . , Worker_(n))(eachindividually and collectively 806). In addition predefined componentidentifiers can be prestored in an object store 808. Asset database 812(also referred to as asset store 812) can be configured to have pluralbuckets that store media objects. A workflow definition 810 is called toexecute a task.

In an example implementation, a mixing workflow mixes a componentidentifier that has been predefined and stored in object store 808 witha media object stored in asset database 812 and made available to eachworker 806 (e.g., Worker₁, Worker₂, Worker₂, . . . Worker_(n)) in case aworker needs to use it. For example, if a worker is in charge of mixingan audio component identifier stored in object store 808 with a mediaobject, the mixing workflow can mix the audio component identifier andthe media object and store in asset database 812 (e.g., in a bucket) andmake available to the workers the mix of the media object and thecomponent identifier.

In one embodiment, a failure mode cause creative development platform200 to repeat workflows. This is accomplished by making each workflowidempotent. An idempotent workflow is a workflow that produces the sameresults if executed once or multiple times. This configuration avoidsthe need to undo any of the work that has already been done by theworkflows in the event of a failure. In other words, an operation can berepeated or retried as often as necessary without causing unintendedeffects while avoiding the need to keep track of whether the operationwas already performed or not.

A workflow definition 810 can be performed more than one time until thecorrect results are achieved. An attempt can be made to, for example, toperform a workflow definition 810 that traffics a creative more than onetime without actually releasing the creative. Similarly, an attempt toperform a workflow that calculates or communicates billing informationcan be performed more than one time. In yet another aspect, an attemptto perform a workflow that mixes audio more than one time can beperformed.

The example pipeline flow definition code can be stored in memory. Thepipeline manager has a pool of threads that are available to performwork and available internally. The pipeline manager manages execution ofplural threads that communicate messages to a corresponding worker. Theworker returns a result. Based on the result, the manager references theapplicable workflow definition, choose the next step and passes the workto the next worker via another thread. In an example embodiment, this isaccomplished by placing messages onto the message queue. The system isthus asynchronous. The message queue allows the system to be scalableand distributable. Thus several systems of workers can be createdindependently thereby eliminating the need to limit the workers to apredetermined number threads (e.g., a initiate command that initiatesthe generation of creative, a boost command that causes creativesassociated with a predetermined object to be generated).

Personalized Creatives with Call to Action

FIG. 11 illustrates a dynamic call to action process 1100 in accordancewith an example embodiment. Generally, dynamic call to action process1100 involves generating scripts that are information and/or calls foraction. In the case where the scripts are calls for action the dynamiccall to action process causes a device to expect input through its inputinterface. The input and calls for action are generated according toinformation associated with a promoted entity (e.g., datapoint valuesreceived from a promoter via external system 114, datapoint valuesassociated with a user (e.g., received from a service storing dataassociated with the user), and datapoint values associated with a deviceoperated by the user (e.g., device 106). A promoted entity is an entitythat is the subject of advertising or promotion, where advertisinggenerally refers to controlled, messages in the media, while promotionincludes marketing activities, such as sales or sponsorships. Examplepromoted entities include, a brand, a business, an organization, aproduct, a place, a concert, media content (audio content, videocontent, image content, games, podcasts, books, etc.), and the like.

As shown in the legend of FIG. 11, dynamic call to action process 1100includes (1) taking an action, (2) checking for possible outcomes, (3)supplying script elements that are definite (referred to as definitescript elements), (4) supplying script elements that are possible(referred to as possible script elements), (5) taking user context orpreferences as input and (6) taking a promoted entities metadata asinput.

The example implementation depicted in FIG. 11 relates to an objectiveinvolving a concert promotion. The script elements (e.g., definitescript elements and possible script elements) are text that arepresented (e.g., played back) through a device 106 using text to speechprocessing. In an example embodiment at least one or more of the scriptelements are mixed with other audio files (e.g., background music) usingthe mixing system 1000 described above in connection with FIG. 10 bystoring the text to voice generated during the execution of process 1100as a file. An audio file (e.g., background music) can be obtained asdescribed above in connection with FIG. 9.

In some embodiments, the script elements can be streamed. Thus insteadof storing them as a file (e.g., a voiceover file), the script elementscan be retrieved in realtime.

Definite script elements can be fixed or variable. A definite scriptelement that is fixed is referred to as a fixed definite script element.A definite script element that is variable is referred to as a variabledefinite script element. A fixed definite script element is a scriptelement that is in every script for a particular campaign objective(e.g., as selected by a promoter using interface 300A discussed above inconnection with FIG. 3A). Example fixed definite script elements aredepicted in FIG. 11 according to the legend “Script Element: Definite”and as shown in FIG. 11 have a term or phrase within quotes.

For example, a fixed definite script element for an ad objectcorresponding to a concert will always include the phrase “ConcertTickets” (block 1102), the term “For” (block 1112), the term “At” (block1124), and the term “In” (block 1128). Other fixed definite scriptelements include punctuation such as a comma “,” (block 1132, block1136), a period (block 1142), a question mark “?” (not shown), and thelike, which when converted to speech cause the speech synthesizer topause or presented (e.g., played back) with inflection or emphasispoints, tone, or other information.

A variable definite script element is a script element that is includedin a script and includes content that can vary. Thus, like a fixeddefinite script element, a variable definite script element is a scriptelement that is always played for a particular campaign objective (e.g.,as selected by a promoter using interface 300A discussed above inconnection with FIG. 3A) but the value of the variable definite scriptelement will change. For example, a variable definite script element foran ad object corresponding to a concert can be set to always include thename of the main artists (block 1122), the name of the venue closest toa user (block 1126), the name of the city in which the venue resides(block 1130), day of the week (block 1134), among others, but thosescript elements will always vary. Example variable definite scriptelements are depicted in FIG. 11 according to the legend as “ScriptElement: Definite” and as shown in FIG. 11 have an attribute of thescript element (i.e., the value) within quotes and brackets.

As explained above, a fixed definite script element is used in allscripts generated for a particular type of objective (e.g., a concertwill always include the phrase “Concert tickets” as shown in block1102). Such fixed definite script elements can be prestored stored in amemory store. Optionally, such fixed definite script elements can beprestored stored in a memory store that is relatively faster than memorystores that store other data (e.g., variable fixed definite scriptelements) to increase the speed at which fixed definite script elementscan be accessed.

Possible script elements also can be fixed or variable. A possiblescript element that is fixed is referred to as a fixed possible scriptelement. A possible script element that is variable is referred to as avariable possible script element. Unlike definite script elements, apossible script element is selected based on one or more factors and isnot necessarily included in an advertisement creative. In someembodiments, factors that determine whether a possible script element isused include information related to the end user (e.g., user context oruser preferences). In some embodiments, factors that determine whether apossible script element is used include information related to the adcampaign. In some embodiments, factors that determine whether a possiblescript element is used include information related to the device thatwill receive the ad creative.

In some embodiments, there exist multiple options for either a definitescript element or a possible script element. A definite script elementthat is selectable is referred to as a selectable definite scriptelement. For a given situation, a selection of one selectable definitescript elements is made. Depending on when the relative campaign startdate is, for example, one of multiple selectable definite scriptelements can be selected (e.g., selectable (fixed) definite scriptelements 1106, 1108 or 1110). Thus if a definite script element is oneof several possible definite script elements, then it is referred to asa selectable definite script element (e.g., a first selectable definitescript element, a second selectable definite script element, and so on).

In some embodiments, selectable definite script elements can be fixed orvariable. A selectable definite script element that is fixed is referredto as a selectable fixed definite script element. A selectable definitescript element that is variable is referred to as a selectable variabledefinite script element. Example fixed definite script elements that areselectable (i.e., selectable fixed definite script elements) aredepicted in FIG. 11 according to the legend “Script Element: Definite”and where the selectable fixed definite script elements follow aprocedure that checks for possible outcomes and causes the process toselect a selectable fixed definite script element based on the outcome.

Example procedures that check for possible outcomes include a decisionfunction and a data retrieval function. An example data retrievalfunction is shown in FIG. 11 as data retrieval function 1104. Dataretrieval function 1104 particularly retrieves data corresponding towhen, relative to a particular promotion, the ad campaign is being made.Based on when the promotion is occurring dictates which selectabledefinite script element is selected.

Although not shown in FIG. 11, a selectable variable definite scriptelement would be depicted according to the legend as “Script Element:Definite” and where the variable definite script elements that areselectable (i.e., the selectable variable definite script elements)follow a procedure that checks for possible outcomes and causes theprocess to select a selectable variable definite script element based onthe outcome.

In some embodiments, the process performs a check (also referred to as adetermination). A corresponding script element is obtained based on thecheck.

In turn, predetermined criteria can be selected based on the informationretrieved from the checking. As shown in FIG. 11, in some embodiments,predetermined criteria 1105 are selected based on the possible outcomesobtained from data retrieval function 1104. In some embodiments,predetermined criteria 1105 can be a threshold based on time, referredto for simplicity as a time threshold. In some embodiments,predetermined criteria 1105 can be an inventory value, referred tosimply as inventory criteria. If a first predetermined criteria has beenmet, then a first definite script element is selected. If a secondpredetermined criteria has been met, then a second definite scriptelement is selected. If a third predetermined criteria has been met,then a third definite script element is selected. And so on. The conceptof fixed and variable have been omitted for ease of understanding.Example predetermined criteria 1105 includes a time threshold that canbe in the units of days, hours, minutes, and the like. Examplepredetermined criteria 1105 can includes an inventory such as aninventory of tickets.

The particular example shown in FIG. 11 involves the sale of tickets fora concert ticket sales campaign. If the creative for the concert ticketsales campaign is for tickets that will be on sale in x days, then afirst selectable fixed definite script element 1106 (e.g., “Will be onsale soon”) is selected. If the creative for the concert ticket salescampaign is for tickets that are now on sale (e.g., after x minutes frombeing on sale), then a second selectable fixed definite script element1108 (e.g., “Are now on Sale”) is selected. If the creative for theconcert ticket sales campaign is for tickets that are now on sale (e.g.,after x minutes from being on sale) and there are y tickets left, then athird selectable fixed definite script element 1110 is selected. Asdescribed above x is in units related to time (e.g., days, hours,minutes, etc.) and y is an integer.

As explained above, it should be understood that the selectable definitescript elements can be variable. For example, instead of beingselectable fixed definite script elements 1106, 1108 and/or 1110, scriptelements 1106, 1108 and/or 1110 can include fillable fields, where thefillable fields are filled with data obtained from a database. The datathat is used to fill the fields can vary based on the outcome of thecheck. Were this the case, script elements 1106, 1108 and 1110 would beselectable variable definite script elements.

Process 1100 can proceed based on the results of a check for possibleoutcomes. For example, as shown in block 1114 a determination is made asto whether there exist multiple artists related to a concert adcampaign. The determination as to whether there are multiple artistsrelated to the concert ad campaign can be based on metadata obtainedfrom the promoted entity.

In the example shown in FIG. 11, if a determination is made at block1114 that there do exist multiple artists associated with the concert adcampaign, then a query 1116 can be sent to a processor. In the exampleshown in FIG. 11 the query 1116 is a query for the artist with thehighest user affinity. The artist with the highest user affinity isperformed using now known or future developed processes for selecting anartist with the highest user affinity.

In response data that can be inserted into a variable possible scriptelement 1118 is received. In this example, variable possible scriptelement 1118 is followed by a fixed possible script element 1120.

In some embodiments the variable definite script element contains inputcorresponding to user context. In some embodiments the variable definitescript element contains input related to user preferences. In someembodiments the variable definite script element contains metadatarelated to a promoted entity. Example variable definite script element1122 contains a name of an artist received from a metadata databasestoring metadata related to a promoted entity. Example variable definitescript element 1126 contains a name of a venue received from a metadatadatabase storing metadata related to a promoted entity. In some exampleembodiments, variable definite script element 1126 contains a name of avenue received from a metadata database storing metadata related to apromoted entity that is closest to the recipient of a correspondingcreative.

In some example embodiments, variable definite script element 1130contains a name of a city received from a metadata database storingmetadata related to a promoted entity. In some example embodiments,example variable definite script element contains a date of an eventreceived from a metadata database storing metadata related to a promotedentity. For example, variable definite script element 1134 contains aday of the week, variable definite script element 1138 contains a monthand variable definite script element 1140 contains a year.

In some embodiments a check for possible outcomes includes collectingone or more information items from a device 106 and determining whetheror not a condition related to the device 106 is met (e.g., true).Subsequent checks for possible outcomes are based on the determinationas to whether or not the condition related to the device 106 is met.

As shown in block 1144, for example, a determination is made using amobile device (e.g., devices 106-1, 106-2, and the like) as to whether auser is driving. If not, a determination is then made as to whether theuser is in focus, as shown in block 1146. A user is in focus if thedevice of the user is capable of receiving a communication. Thecommunication can be an audio communication, a visual communication, ora combination of an audio communication and visual communication. Adetermination as to whether a user is in focus can be performed by usingthe sensor components and software of a mobile device 106. In someembodiments, for example, device 106 may optionally include a motionsensor 128, such as a gyro-movement sensor or accelerometer that isarranged to sense that device 106 is in motion and/or is beingaccelerated or decelerated. In some embodiments, a camera or similaroptical sensor can be used to determine whether a user is looking at thedevice 106. Similarly, audio sensors on device 106 can detect whether auser is present by listening for sounds from the user. Both the audioand visual sensor data can be processed in conjunction with the datarelating to whether the device 106 is moving such that if a user islooking at the mobile device but driving, an appropriate script oraction will follow.

If a determination is made at block 1144 that the user is driving thecall to action process 1100 ends (block 1168).

If a determination is made at block 1144 that the user is not drivingand a determination is made at block 1146 that the user is in focus,then a definite script element is played, where the definite scriptincludes an instruction as to how the user of the device 106 shouldrespond, as shown at block 1150. When a script element requests anaction of a user via a device, such a script is referred to as a callfor action script element.

In this example, the users is instructed via a call for action scriptelement to tap the device to obtain tickets. The device is programmed towait for a tap (e.g., a tap of a particular icon or simply a tap of thehousing of the mobile device which is detected by a vibration sensor inthe mobile device, via a capacitive sensor of the mobile device, orother touch or vibration sensing component of the mobile device). If adetermination has been made at block 1156 that the device has received atap, the device 106 proceeds with taking an action. In this example, theaction involves a ticketing action, as shown at block 1158. Any numberof now known or future known mechanisms for effecting an action uponreceipt of user input (e.g., a tap) can be taken. If a determination ismade at block 1156 that a user has not tapped the device within apredetermined amount of time (e.g., 30 seconds), then the process ends.

If a determination is made at block 1146 that the user is not in focus,then a determination is made whether the device of the user is in aspeakable state, as shown in block 1148. A speakable state is a state inwhich a user can verbalize a response via a device. If a determinationis made at block 1148 that the user is in a speakable state, then ascript element containing an utterance including an instructioninstructing the user to speak a certain utterance is played throughdevice 106, as shown in block 1152. In the example shown in FIG. 11,script element 1152 is a fixed definite script element. A script elementthat provides an instruction can also be referred to as an instructionscript element. Instruction script elements can be any combination ofdefinite or possible and fixed or variable.

Upon playing the script element 1152, the dynamic call to action process1100 causes the device 106 to receive a voice utterance as shown inblock 1160. In an example implementation, the device 106 receives avoice utterance by turning on the microphone of the device 106, playinga microphone on tone, and turning on a visual listening indicator. Uponreceiving an utterance via a microphone, a determination is made atblock 1162 as to what the user said. This can be performed by now knownor future developed natural language processing functions (e.g., voicerecognition). Depending on what the user has uttered will determine thenext action. In the example shown there exist three types of actions, afirst action, a second action and a third action. It should beunderstood that there could be more types of actions available.

In the example implementation illustrated in FIG. 11, if a determinationhas been made at block 1162 that the user said nothing for apredetermined amount of time, the process causes the device to perform afirst action. In the example implementation, the first action is anaction to play a microphone off tone (block 1166) and an action to endthe advertisement (block 1168). If a determination has been made atblock 1162 that the user spoke an expected utterance (e.g., “Savethis”), the process causes the device to perform a second action asshown in block 1164. In the example implementation, the second action isan is for the device to play a sound indicating that receipt of theinstructions was successful, play the microphone off tone (block 1166)and end the advertisement as shown in block 1168.

If a determination is made at block 1162 that the user uttered somethingelse (e.g., an utterance that was not expected by the process), then theprocess causes the device to perform a third action. In this example thethird action is for the device to play an error tone as shown in block1170 and then, for example, repeat a verbal script instructing the userto speak a certain utterance, as shown in block 1152. Optionally,another verbal script can be provided (not shown).

If a determination is made at block 1148 that the user is not in aspeakable state, then at block 1154 the process causes a third script tobe played through the device 106. In turn, the process causes the deviceto wait for a response, as shown in block 1172. In this example, theresponse that is expected is a double tap that is detected via a sensor(e.g., the accelerometer) of the device 106. If a determination is madeat block 1174 that the device received the expected response (e.g., adouble tap) then the process causes the device to perform an a secondaction as shown in block 1164. In the example implementation, the secondaction is for the device to play a sound indicating that receipt of theinstructions was successful (block 1164), play the microphone off tone(block 1166) and end the advertisement (block 1168).

In addition to or instead of an audio sound, a haptic feedback can beinitiated by the device 106.

If a determination is made at block 1174 that the user did not doubletap within a predetermined time, then the advertisement ends (block1168).

FIG. 12 illustrates a dynamic call to action process 1200 in accordancewith an example embodiment. Generally, dynamic call to action process1200 involves generating scripts that are information and/or calls foraction. In the case where the scripts are calls for action the dynamiccall to action process causes a device to expect input through its inputinterface. The input and calls for action are generated according toinformation associated with a promoted entity (e.g., datapoint valuesreceived from a promoter via external system 114, datapoint valuesassociated with a user (e.g., received from a service storing dataassociated with the user), and datapoint values associated with a deviceoperated by the user (e.g., device 106).

As shown in the legend of FIG. 12, dynamic call to action process 1200includes (1) taking an action, (2) checking for possible outcomes, (3)supplying script elements that are definite (referred to as definitescript elements), (4) supplying script elements that are possible(referred to as possible script elements), (5) taking user context orpreferences as input and (6) taking a promoted entities metadata asinput.

The example implementation depicted in FIG. 12 relates to anadvertisement campaign involving a podcast promotion. The scriptelements (e.g., definite script elements and possible script elements)are text that are presented (e.g., played back) through a device 106using, for example, text to speech processing. In an example embodimentat least one or more of the script elements are mixed with other audiocontent (e.g., background music) using the mixing system 1000 describedabove in connection with FIG. 10 by storing the text to voice generatedduring the execution of process 1200 as a file. The audio file (e.g.,background music) can be obtained as described above in connection withFIG. 9.

In some embodiments, the script elements can be streamed. Thus insteadof storing them as a file (e.g., a voiceover file), the script elementscan be retrieved in realtime.

Definite script elements can be fixed or variable. A definite scriptelement that is fixed is referred to as a fixed definite script element.A definite script element that is variable is referred to as a variabledefinite script element. A fixed definite script element is a scriptelement that is in every script for a particular campagin objective(e.g., as selected by a promoter using interface 300A discussed above inconnection with FIG. 3A). Example fixed definite script elements aredepicted in FIG. 12 according to the legend “Script Element: Definite”and as shown in FIG. 12 have a term or phrase within quotes.

For example, a fixed definite script element for an ad objectcorresponding to a podcast will always include the term “Episode” (block1224), and the phrase “Is now out on Spotify” (block 1228). Other fixeddefinite script elements can include punctuation such as a comma “,” aperiod “.”, a question mark “?”, and the like, which when converted tospeech cause the speech synthesizer to pause or create an inflection oremphasis points, tone, or other information.

A variable definite script element is a script element that is includedin a script and includes content that can vary. Thus, like a fixeddefinite script element, a variable definite script element is a scriptelement that is always played for a particular campaign objective (e.g.,as selected by a promoter using interface 300A discussed above inconnection with FIG. 3A) but the value of the variable definite scriptelement will change. For example, a variable definite script element foran ad object corresponding to a podcast can be set to always include anepisode number (block 1226), but the episode number itself may vary.Example variable definite script elements are depicted in FIG. 12according to the legend as “Script Element: Definite” and as shown inFIG. 12 have an attribute of the script element (i.e., the value) withinquotes and brackets.

As explained above, a fixed definite script element is used in allscripts generated for a particular type of campaign (e.g., a podcastwill always include the phrase “Episode” as shown in block 1324 and “isnow out on Spotify” (block 1228)). Such fixed definite script elementscan be prestored stored in a memory store. Optionally, such fixeddefinite script elements can be prestored stored in a memory store thatis relatively faster than memory stores that store other data (e.g.,variable fixed definite script elements) to increase the speed at whichfixed definite script elements can be accessed.

Possible script elements also can be fixed or variable. A possiblescript element that is fixed is referred to as a fixed possible scriptelement. A possible script element that is variable is referred to as avariable possible script element. Unlike definite script elements, apossible script element is selected based on one or more factors and isnot necessarily included in an advertisement creative. In someembodiments, factors that determine whether a possible script element isused include information related to the end user (e.g., user context oruser preferences). In some embodiments, factors that determine whether apossible script element is used include information related to the adcampaign. In some embodiments, factors that determine whether a possiblescript element is used include information related to the device thatwill receive the ad creative.

In some embodiments, there exist multiple options for either a definitescript element or possible script element. Such a definite scriptelement is referred to as a selectable definite script element. For agiven situation, a selection of one selectable definite script elementsis made. Depending on when the relative campaign start date is, forexample, one of multiple selectable definite script elements can beselected (e.g., selectable definite script elements 1206 or 1208, whichas explained below, in this example are fixed). Thus if a definitescript element is one of several possible definite script elements, thenit is referred to as a selectable definite script element.

In some embodiments, selectable definite script elements can be fixed orvariable. A selectable definite script element that is fixed is referredto as a selectable fixed definite script element. A selectable definitescript element that is variable is referred to as a selectable variabledefinite script element. Example fixed definite script elements that areselectable (i.e., selectable fixed definite script elements) aredepicted in FIG. 12 according to the legend “Script Element: Definite”and where the selectable fixed definite script elements follow aprocedure that checks for possible outcomes and causes the process toselect a selectable fixed definite script element based on the outcome.

Example procedures that check for possible outcomes include a decisionfunction and a data retrieval function. An example data retrievalfunction is shown in FIG. 12 as data retrieval function 1202. Dataretrieval function 1202 particularly retrieves data corresponding towhether a user has listened to a a particular podcast before. Whetherthe user has listened to the particular podcast before dictates whichselectable definite script element is selected.

As shown in FIG. 12, a selectable variable definite script element isdepicted according to the legend as “Script Element: Definite” where thevariable definite script elements that are selectable (i.e., theselectable variable definite script elements) follow a procedure thatchecks for possible outcomes and causes the process to select aselectable variable definite script element based on the outcome. In theexample implementation illustrated by FIG. 12, the selectable variabledefinite script elements are block 1210 (“[Name of that SimilarPodcast]”), block 1212 (“[Podcast Category]).

Process 1200 can proceed based on the results of a check for possibleoutcomes. For example, as shown in block 1202, a determination is madeas to whether the user has listened to a particular podcast before, inblock 1204, a determination is made as to whether the user has listenedto a similar podcast before, in block 1218, a determination is made asto whether the podcast has multiple seasons.

In some embodiments the variable definite script element contains inputcorresponding to user context. In some embodiments the variable definitescript element contains input related to user preferences. In someembodiments the variable definite script element contains metadatarelated to a promoted entity. Example variable definite script element1222, for example, contains an episode number of a podcast received froma metadata database storing metadata related to the promoted entity(e.g., the podcast).

In some embodiments a check for possible outcomes includes collectingone or more information items from a device 106 and determining whetheror not a condition related to the device 106 is met (e.g., true).Subsequent checks for possible outcomes are based on the determinationas to whether or not the condition related to the device 106 is met.

As shown in block 1230, for example, a determination is made using amobile device (e.g., devices 106-1, 106-2, and the like) as to whether auser is driving. If not, a determination is then made as to whether theuser is in focus, as shown in block 1232. A user is in focus if thedevice of the user is capable of receiving a communication. Thecommunication can be an audio communication, a visual communication, ora combination of an audio communication and visual communication. Adetermination as to whether a user is in focus can be performed by usingthe sensor components and software of a mobile device 106. In someembodiments, for example, device 106 may optionally include a motionsensor 128, such as a gyro-movement sensor or accelerometer that isarranged to sense that device 106 is in motion and/or is beingaccelerated or decelerated. In some embodiments, a camera or similaroptical sensor can be used to determine whether a user is looking at thedevice 106. Similarly, audio sensors on device 106 can detect whether auser is present by listening for sounds from the user. Both the audioand visual sensor data can be processed in conjunction with the datarelating to whether the device 106 is moving such that if a user islooking at the mobile device but driving, an appropriate script oraction will follow.

If a determination is made at block 1230 that the user is driving thecall to action process 1200 ends (block 1254).

If a determination is made at block 1230 that the user is not drivingand a determination is made at block 1232 that the user is in focus,then a definite script element is played, where the definite scriptincludes an instruction as to how the user of the device 106 shouldrespond, as shown at block 1236. When a script element requests anaction of a user via a device, such a script is referred to as a callfor action script element.

In this example, the users is instructed via a call for action scriptelement to tap the device to listen to a media item (e.g., a podcastmedia item). The device is programmed to wait for a tap (e.g., a tap ofa particular icon or simply a tap of the housing of the mobile devicewhich is detected by a vibration sensor in the mobile device). If adetermination has been made at block 1242 that the device has received atap, the device 106 proceeds with taking an action. In this example, theaction involves clicking through to an episode page, as shown at block1244. Any number of now known or future known mechanisms for effectingan action upon receipt of user input (e.g., a tap) can be taken. If adetermination is made at block 1242 that a user has not tapped thedevice within a predetermined amount of time (e.g., 30 seconds), thenthe process ends (block 1254).

If a determination is made at block 1232 that the user is not in focus,then a determination is made whether the device of the user is in aspeakable state, as shown in block 1148. A speakable state is a state inwhich a user can verbalize a response via a device. If a determinationis made at block 1234 that the user is in a speakable state, then ascript element containing an utterance including an instructioninstructing the user to speak a certain utterance is played throughdevice 106, as shown in block 1238. In the example shown in FIG. 12,script element 1238 is a fixed definite script element. A script elementthat provides an instruction can also be referred to as an instructionscript element. Instruction script elements can be any combination ofdefinite or possible and fixed or variable.

Upon playing the script element 1238, the dynamic call to action process1200 causes the device 106 to receive a voice utterance as shown inblock 1246. In an example implementation, the device 106 receives avoice utterance by turning on the microphone of the device 106, playinga microphone on tone, and turning on a visual listening indicator. Uponreceiving an utterance via a microphone, a determination is made atblock 1248 as to what the user said. This can be performed by now knownor future developed natural language processing functions (e.g., voicerecognition). Depending on what the user has uttered will determine thenext action. In the example shown there exist three types of actions, afirst action, a second action and a third action. It should beunderstood that there could be more types of actions available.

In the example implementation illustrated in FIG. 12, if a determinationhas been made at block 1246 that the user said nothing for apredetermined amount of time, the process causes the device to perform afirst action. In the example implementation, the first action is anaction to play a microphone off tone (block 1252) and an action to endthe call to action process (block 1254). If a determination has beenmade at block 1248 that the user spoke an expected utterance (e.g.,“Save this”), the process causes the device to perform a second actionas shown in block 1250. In the example implementation, the second actionis an is for the device to play a sound indicating that receipt of theinstructions was successful, play the microphone off tone (block 1252)and end the call to action process 1200 as shown in block 1254.

If a determination is made at block 1248 that the user uttered somethingelse (e.g., an utterance that was not expected by the process), then theprocess causes the device to perform a third action. In this example,the third action is for the device to play an error tone as shown inblock 1256 and then, for example, repeat a verbal script instructing theuser to speak a certain utterance, as shown in block 1238. Optionally,another verbal script can be provided (not shown).

If a determination is made at block 1234 that the user is not in aspeakable state, then at block 1240 the process causes a third script tobe played through the device 106. In turn, the process causes the deviceto wait for a response, as shown in block 1260. In this example, theresponse that is expected is a double tap that is detected via a sensor(e.g., the accelerometer) of the device 106. If a determination is madeat block 1260 that the device received the expected response (e.g., adouble tap) then the process causes the device to perform an a secondaction as shown in block 1250. In the example implementation, the secondaction is for the device to play a sound indicating that receipt of theinstructions was successful (block 1164), play the microphone off tone(block 1252) and end the call to action process 1200 (block 1254).

In addition to or instead of an audio sound, a haptic feedback can beinitiated by the device 106.

If a determination is made at block 1260 that the user did not doubletap within a predetermined time, then the call to action process 1200ends (block 1254).

The voiceover length may vary as possible script elements get added orchosen from a set of possible variations. As such the length or lengthsof the background music that is mixed with the script elements may needto be modified. Background music that is mixed can be clipped orextended to accommodate this variable voiceover length in several ways.

In one example embodiment, the background music clips are arranged asloop-able segments. The number of loops can be selected, for example,based on voiceover length.

In another embodiment, the top n clips (where n is an integer) areranked for different lengths (e.g., clip for 30 s, clip for 8 s). Howthe clips are ranked can vary (e.g., based on affinity, relevance, andthe like). In some embodiments, the ranked list that is selected isbased on voiceover length

In another example embodiment, a background clip for the longestpossible voiceover is selected and analyzed for possible earlierbreakpoints if the voiceover is shorter. The analysis is performed usingnow known or future developed mechanisms for determining breakpoints.

The call to action processes described above with respect to FIGS. 11and 12 can be performed by one or more processors. Particularly, whenthe methods described herein are executed by the one or more processors,the one or more processors perform the dynamic call to action processes.For convenience the one or more processors that perform the dynamic callto action processes are called call to action processor. The one or moreprocessors described below in connection with a script processor can bethe same or different as those used in connection with the call toaction processor. Accordingly, in some example embodiments, the call toaction processor performs at least some of the procedures performed bythe script processor. In some embodiments, the script processor performsat least some of the same procedures performed by the call to actionprocessor.

Inserting Localized or Personalized Spots into Ads

FIG. 13 illustrates an example personalized spot, a generic spot andbackground music according to an example embodiment. The input to thesystem is a text script that includes “personalized fields”. The textthat is not a part of a personalized field is referred to as the“generic spot”, and each personalized field as a “personalized spot”.The personalized spots are given as a list of values (e.g. a list ofdates), and are generated both manually (e.g. for an artist's tourlocations) or automatically (e.g. user's names, locations). The outputof the system—the personalized audio advertisement—is delivered inreal-time by a media distribution server 112 to the end-user. Theseexamples can be used as voiceover script elements.

In the example depicted in FIG. 13, the personalized spot that isgenerated is converted to a personalized voice file 1302. In turn, thepersonalized voice file 1302 is mixed with background music that hasbeen saved as a background music file 1306.

Similarly, the generic spot that is generated is converted to a genericvoice file 1304. In turn, the generic voice file 1304 is mixed withbackground music that has been saved as a background music file 1306.

In some embodiments, a script processor (not shown) is used to generatea script (or script section) based on, for example, input providedthrough a graphical user interface. In some embodiments, the script isgenerated by the script processor based on script sections received overa network.

The mechanism for mixing is described above in connection with FIG. 10.Referring to both FIGS. 10 and 13, the personalized voice file 1302 isthe voice file 1004-2 and the background music file 1306 is music file1006-2.

Example Scripts

-   The following are some example scripts in accordance with some    embodiments.-   “Hey [user's name], enjoying listening to [artist]? We think you    might also enjoy [related artist].”-   “Hey [user's name]! You've listened to [artist] [number] times this    month! As a way to say thanks, we'd like to offer you presale    tickets to their show on [date] at [venue]. Click on the banner to    access to unlock this offer”-   “There are only [number] more tickets left for [artist]'s show at    [venue] on [date]! Click on the banner to get tickets for as low as    [price]!”-   [dynamic creatives in 3P ads]-   Mobile gaming: “Oh no! looks like you have [x] life left! Listen to    [track name] for [y] more!”

Generic Spot Creation

The generic spot need only be created once. Ideally, a single voiceactor (or virtual voice actor) will read through all portions of thegeneric script. For example, reading the script:

“[Hey user] Did you know that Saint Lucia is going to light up the stagewith special guests Joe Artist? After opening for Patty Artist andCharlie Artist, Saint Lucia is ready to bring the dance party to [venueon date]. Tickets on sale now at www dot ticket seller dot com.”

In order to splice this generic spot with personalized spots, the audiois segmented. A text alignment system is utilized to find break points(i.e. where the [{circumflex over ( )}] segments occur).

The voice actor for the generic spot could be, for example, asynthesized voice, an artist or a famous actor.

Personalized Spot Creation

When using a virtual voice actor to create audio segments, the profileof the virtual voice actor is chosen to most closely match the sound andstyle of the voice actor in the generic spot. To match profiles, timbre,pitch, and speaking contour descriptors are automatically extracted fromthe generic spot's voice over, and used to drive the parameters of thevirtual voice actor. When using a human voice actor, if the list ofpersonalized spots is small (e.g. <100), a single voice actor reads eachof them in sequence “. . . at Madison Square Garden in New York City onDecember 2nd . . . at the Shoreline Amphitheatre in Mountain View onDecember 8th . . . ”. The spots are then segmented using the textalignment system described in the previous section.

The voice actor is either the same as for the generic spot, instructedto match the sound and style of the generic spot's voice actor, or givencustom instructions provided by the user.

Spot Segment Post-Processing

Each of the segments (general and personalized) are automaticallymastered and normalized (volume adjustments and silence removal) asdescribed above in connection with FIG. 10.

Delivering Ads with Personalized Spots

Targeting is passed through the ad system and the correspondingpersonalized spots can be fetched by the metadata associated with thetrack.

For instance the pre-generated track: “Enjoying listening to <BandX>? Wethink you might also enjoy <BandY>” will have the metadata tagged with{“currentArtist” : “BandX”, “suggestedArtist”: “BandY”}.

When the ad server determines that the user is in the correct context toserve a promoted suggestion of {“currentArtist”: “BandX”,“suggestedArtist”: “BandY”}, then the pre-generated track will befetched and served at that time. This example can be extended to morenumbers of vectors in the personalization.

FIG. 14 illustrates a delivered audio file 1516 that has been created inreal-time according to the example embodiments described herein. Asshown in FIG. 15, the selected personalized spots 1504, 1508 aredelivered with a set of start time, end time, and volume instructions.Similarly, the selected generic spots 1502, 1506 and 1510 are deliveredwith a set of start time, end time and volume instructions.Post-processed generic spots and personalized spots are merged using ashort crossfade (as illustrated in the “gain” 1512-1, 1512-2, 1512-3,1512-4, and 1512-5 and overlapping start/end time parameters) to ensurea seamless transition.

The example embodiments described herein may be implemented usinghardware, software or a combination thereof and may be implemented inone or more computer systems or other processing systems. However, themanipulations performed by these example embodiments were often referredto in terms, such as entering, which are commonly associated with mentaloperations performed by a human operator. No such capability of a humanoperator is necessary, in any of the operations described herein.Rather, the operations may be completely implemented with machineoperations. Useful machines for performing the operation of the exampleembodiments presented herein include general purpose digital computersor similar devices.

From a hardware standpoint, a CPU typically includes one or morecomponents, such as one or more microprocessors, for performing thearithmetic and/or logical operations required for program execution, andstorage media, such as one or more memory cards (e.g., flash memory) forprogram and data storage, and a random access memory, for temporary dataand program instruction storage. From a software standpoint, a CPUtypically includes software resident on a storage media (e.g., a memorycard), which, when executed, directs the CPU in performing transmissionand reception functions. The CPU software may run on an operating systemstored on the storage media, such as, for example, UNIX or Windows, iOS,Linux, and the like, and can adhere to various protocols such as theEthernet, ATM, TCP/IP protocols and/or other connection orconnectionless protocols. As is well known in the art, CPUs can rundifferent operating systems, and can contain different types ofsoftware, each type devoted to a different function, such as handlingand managing data/information from a particular source, or transformingdata/information from one format into another format. It should thus beclear that the embodiments described herein are not to be construed asbeing limited for use with any particular type of server computer, andthat any other suitable type of device for facilitating the exchange andstorage of information may be employed instead.

A CPU may be a single CPU, or may include plural separate CPUs, whereineach is dedicated to a separate application, such as, for example, adata application, a voice application, and a video application. Softwareembodiments of the example embodiments presented herein may be providedas a computer program product, or software, that may include an articleof manufacture on a machine accessible or non-transitorycomputer-readable medium (i.e., also referred to as “machine readablemedium”) having instructions. The instructions on the machine accessibleor machine readable medium may be used to program a computer system orother electronic device. The machine-readable medium may include, but isnot limited to, optical disks, CD-ROMs, and magneto-optical disks orother type of media/machine-readable medium suitable for storing ortransmitting electronic instructions. The techniques described hereinare not limited to any particular software configuration. They may findapplicability in any computing or processing environment. The terms“machine accessible medium”, “machine readable medium” and“computer-readable medium” used herein shall include any non-transitorymedium that is capable of storing, encoding, or transmitting a sequenceof instructions for execution by the machine (e.g., a CPU or other typeof processing device) and that cause the machine to perform any one ofthe methods described herein. Furthermore, it is common in the art tospeak of software, in one form or another (e.g., program, procedure,process, application, module, unit, logic, and so on) as taking anaction or causing a result. Such expressions are merely a shorthand wayof stating that the execution of the software by a processing systemcauses the processor to perform an action to produce a result.

Various operations and processes described herein can be performed bythe cooperation of two or more devices, systems, processes, orcombinations thereof.

While various example embodiments of the present invention have beendescribed above, it should be understood that they have been presentedby way of example, and not limitation. It will be apparent to personsskilled in the relevant art(s) that various changes in form and detailcan be made therein. Thus, the present invention should not be limitedby any of the above described example embodiments, but should be definedonly in accordance with the following claims and their equivalents.Further, the Abstract is not intended to be limiting as to the scope ofthe example embodiments presented herein in any way. It is also to beunderstood that the procedures recited in the claims need not beperformed in the order presented.

What is claimed is:
 1. A computer-implemented method for voiceovermixing, comprising: receiving a voiceover file and a music file; audioprocessing a voiceover file to generate a processed voiceover file;audio processing a music file to generate a processed music file;weighted summing the processed voiceover file and the processed musicfile to generate a weighted combination of the processed voiceover fileand the processed music file; single band compressing the weightedcombination; and generating a creative file containing a compressed andweighted combination of the processed voiceover file and the processedmusic file.
 2. The computer-implemented method for voiceover mixingaccording to claim 1, further comprising: measuring the energy level ofthe voice file within a frequency range; and filtering the frequencyrange if the energy level exceeds a predetermined threshold.
 3. Thecomputer-implemented method for voiceover mixing according to claim 1:wherein audio processing the voiceover file includes normalizing,compressing and equalizing the voiceover file; wherein audio processingthe music file includes normalizing, compressing and equalizing themusic file; and wherein the voiceover file and the music file arenormalized, compressed and equalized asynchronously.
 4. Thecomputer-implemented method for voiceover mixing according to claim 1,further comprising: storing, in a voice activations store, a curvecorresponding to when a voice is present in the voiceover file.
 5. Thecomputer-implemented method for voiceover mixing according to claim 1,further comprising: setting an advertisement duration time; setting astart time for the voiceover file; trimming the music file according tothe advertisement duration time; and mixing the voiceover file and themusic file according to the start time and the advertisement durationtime.
 6. The computer-implemented method for voiceover mixing accordingto claim 1, further comprising: generating a script; converting thescript to voice content; and saving the voice content in the voiceoverfile.
 7. The computer-implemented method for voiceover mixing accordingto claim 1, further comprising: mapping each track in a library oftracks to a point in an embedding space; computing an acoustic embeddingbased on a query track within the embedding space; obtaining a trackfrom the library of tracks with acoustically similar content; and savingthe track from the library of tracks with acoustically similar contentin the music file.
 8. A system for voiceover mixing, comprising: a voiceprocessor operable to: receive a voiceover file, and generate aprocessed voiceover file from the voiceover file; a music processoroperable to: receive a music file, and generate a processed music filefrom the music file; and a mixing processor operable to: weight sum theprocessed voiceover file and the processed music file to generate aweighted combination of the processed voiceover file and the processedmusic file, single band compress the weighted combination, and generatea creative file containing a compressed and weighted combination of theprocessed voiceover file and the processed music file.
 9. The system forvoiceover mixing according to claim 8, further comprising: the voiceprocessor further operable to: measure the energy level of the voicefile within a frequency range; and filter the frequency range if theenergy level exceeds a predetermined threshold.
 10. The system forvoiceover mixing according to claim 8, the voice processor furtheroperable to normalize, compress and equalize the voiceover file; and themusic processor further operable to normalize, compress and equalize themusic file, wherein the voiceover file and the music file arenormalized, compressed and equalized asynchronously.
 11. The system forvoiceover mixing according to claim 8, further comprising: a voiceactivations store operable to store a curve corresponding to when avoice is present in the voiceover file.
 12. The system for voiceovermixing according to claim 8, further comprising: an advertisement storeoperable to store an advertisement duration time; the voice processorfurther operable to set a start time for the voiceover file; the musicprocessor further operable to trim the music file according to theadvertisement duration time; and the mixing processor further operableto mix the voiceover file and the music file according to the start timeand the advertisement duration time.
 13. The system for voiceover mixingaccording to claim 8, further comprising: a script processor: operableto generate a script from at least one script section; a text to voiceprocessor operable to convert the script to voice content; and avoiceover store configured to save the voice content in the voiceoverfile.
 14. The system for voiceover mixing according to claim 8, furthercomprising: a background music search processor operable to: map eachtrack in a library of tracks to a point in an embedding space; computean acoustic embedding based on a query track within the embedding space;obtain a track from the library of tracks with acoustically similarcontent; and save the track from the library of tracks with acousticallysimilar content in the music file.
 15. A non-transitorycomputer-readable medium having stored thereon one or more sequences ofinstructions for causing one or more processors to perform: receiving avoiceover file and a music file; audio processing a voiceover file togenerate a processed voiceover file; audio processing a music file togenerate a processed music file; weighted summing the processedvoiceover file and the processed music file to generate a weightedcombination of the processed voiceover file and the processed musicfile; single band compressing the weighted combination; and generating acreative file containing a compressed and weighted combination of theprocessed voiceover file and the processed music file.
 16. Thecomputer-readable medium of claim 15, further having stored thereon asequence of instructions for causing the one or more processors toperform: measuring the energy level of the voice file within a frequencyrange; and filtering the frequency range if the energy level exceeds apredetermined threshold.
 17. The computer-readable medium of claim 15:wherein audio processing the voiceover file includes normalizing,compressing and equalizing the voiceover file; and wherein audioprocessing the music file includes normalizing, compressing andequalizing the music file, wherein the voiceover file and the music fileare normalized, compressed and equalized asynchronously.
 18. Thecomputer-readable medium of claim 15, further having stored thereon asequence of instructions for causing the one or more processors toperform: storing, in a voice activations store, a curve corresponding towhen a voice is present in the voiceover file.
 19. The computer-readablemedium of claim 15, further having stored thereon a sequence ofinstructions for causing the one or more processors to perform: settingan advertisement duration time; setting a start time for the voiceoverfile; trimming the music file according to the advertisement durationtime; and mixing the voiceover file and the music file according to thestart time and the advertisement duration time.
 20. Thecomputer-readable medium of claim 15, further having stored thereon asequence of instructions for causing the one or more processors toperform: generating a script; converting the script to voice content;and saving the voice content in the voiceover file.
 21. Thecomputer-readable medium of claim 15, further having stored thereon asequence of instructions for causing the one or more processors toperform: mapping each track in a library of tracks to a point in anembedding space; computing an acoustic embedding based on a query trackwithin the embedding space; obtaining a track from the library of trackswith acoustically similar content; and saving the track from the libraryof tracks with acoustically similar content in the music file.